benoitc · benoitc · May 3, 2026 · May 3, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -92,6 +92,9 @@ jobs:
       - name: Run xref
         run: rebar3 xref
 
+      - name: Lint docs
+        run: escript scripts/lint_doc_snippets.escript
+
   # FreeBSD test using cross-platform action
   test-freebsd:
     name: FreeBSD 14 / Python ${{ matrix.python }}

diff --git a/Makefile b/Makefile
@@ -0,0 +1,20 @@
+.PHONY: all compile test lint-docs clean
+
+all: compile
+
+compile:
+	rebar3 compile
+
+test:
+	rebar3 ct --readable=compact
+
+# Validate fenced code blocks in README.md and docs/*.md.
+# Erlang `py:Fn(...)` calls must reference a real export at the right
+# arity; Python blocks must parse (IndentationError tolerated for
+# tutorial fragments). Mark a block to skip with `<!-- skip-lint -->`
+# on the line immediately above the opening fence.
+lint-docs: compile
+	escript scripts/lint_doc_snippets.escript
+
+clean:
+	rebar3 clean
diff --git a/docs/migration.md b/docs/migration.md
@@ -38,6 +38,7 @@ application:set_env(erlang_python, context_mode, owngil).
 
 **`py:num_executors/0`** - Removed. Contexts now use per-context worker threads.
 
+<!-- skip-lint -->
 ```erlang
 %% v2.x - check executor count
 N = py:num_executors().
@@ -254,6 +255,7 @@ N = py_context_router:num_contexts().
 The function for non-blocking Python calls has been renamed to follow gen_server conventions:
 
 **Before (v1.8.x):**
+<!-- skip-lint -->
 ```erlang
 Ref = py:call_async(math, factorial, [100]),
 {ok, Result} = py:await(Ref).
@@ -355,6 +357,7 @@ For more sophisticated web framework integration, consider the [Reactor API](rea
 The process-binding functions have been removed. The new architecture uses `py_context_router` for automatic scheduler-affinity routing.
 
 **Before (v1.8.x):**
+<!-- skip-lint -->
 ```erlang
 ok = py:bind(),
 ok = py:exec(<<"x = 42">>),
@@ -760,9 +763,9 @@ ImportError: module does not support subinterpreters
 ```
 
 Options:
-1. Use Python < 3.12 (falls back to multi_executor mode)
-2. Check if the library has a subinterpreter-compatible version
-3. Isolate the library usage to a single context
+1. Use Python 3.12 or 3.13: the runtime uses `worker` mode (subinterpreters require Python 3.14+).
+2. Check if the library has a subinterpreter-compatible version.
+3. Isolate the library usage to a single context.
 
 ### Python 3.14: `erlang_loop_import_failed`
 

diff --git a/docs/owngil_internals.md b/docs/owngil_internals.md
@@ -425,22 +425,53 @@ class EchoProtocol(reactor.Protocol):
 
 ## Performance Characteristics
 
-| Operation | Shared-GIL | OWN_GIL |
-|-----------|-----------|---------|
+| Operation | Worker (shared GIL) | OWN_GIL |
+|-----------|--------------------|---------|
 | Call overhead | ~2.5μs | ~10μs |
-| Throughput (single) | 400K/s | 100K/s |
-| Parallelism | None | True |
-| Resource usage | Lower | Higher (1 pthread per context) |
-
-Use OWN_GIL when:
-- CPU-bound Python work that benefits from parallelism
-- Long-running computations
-- Need true concurrent Python execution
-
-Use worker mode when:
-- I/O-bound or short operations
-- High call frequency
-- Resource constraints
+| Throughput (single context) | ~400K/s | ~100K/s |
+| Parallelism (N contexts) | GIL-bound | Linear up to N cores |
+| Resource usage | One pthread per context | One pthread + one subinterpreter per context |
+
+## Pros and Cons
+
+### Pros
+
+- **True CPU parallelism.** Each context owns its GIL, so N contexts run on N cores at once. Worker mode serialises on the main GIL unless Python is built free-threaded (3.13t+).
+- **Crash isolation.** A C-level fault in one subinterpreter leaves the others alive. Worker mode shares the main interpreter, so a corrupt module state can take everything down.
+- **Clean namespace per context.** Each subinterpreter has its own `sys.modules`, so module-level state cannot bleed between contexts. Useful when running adversarial or untrusted code paths side by side.
+- **Predictable scheduling.** Requests are dispatched via mutex/condvar IPC, not dirty schedulers, so OWN_GIL contexts will not be starved by other dirty NIF traffic.
+
+### Cons
+
+- **Python 3.14+ only.** Earlier versions have C-extension global-state bugs (`_decimal`, `numpy`, etc.) that crash inside subinterpreters. See [cpython#106078](https://github.com/python/cpython/issues/106078).
+- **Higher per-call latency.** ~4x the round-trip cost of worker mode (~10μs vs ~2.5μs) because every call crosses a mutex/condvar handoff to the dedicated thread.
+- **Higher memory.** Each subinterpreter imports its own copy of every module. A 50 MB module set across 8 contexts is ~400 MB resident, not 50 MB.
+- **C-extension compatibility is not universal.** Extensions must opt in via the multi-phase init protocol (PEP 489) and `Py_mod_multiple_interpreters`. Pure-Python and well-behaved C extensions work; older ones fail at import inside the subinterpreter.
+- **No shared Python state.** Module globals, class definitions, and cached objects are per-interpreter. Use `py:state_store/2` (ETS-backed) or `erlang.send` for cross-context data.
+- **Callback re-entry is restricted.** When Python in an OWN_GIL context calls `erlang.call`, the callback runs on a thread worker, not back on the OWN_GIL thread (which cannot suspend). Re-entrant Python -> Erlang -> *same* OWN_GIL context calls will not work; use a different context for the nested call, or use `erlang.async_call` from asyncio code.
+- **Process-local envs do not span interpreters.** A `py_env_resource_t` is bound to the interpreter that created it. Reusing one across contexts returns `{error, env_wrong_interpreter}`.
+
+### When to Use Each
+
+Use **OWN_GIL** when:
+
+- The workload is CPU-bound Python (ML inference, numpy/torch compute, parsing, codecs) and you want N-way parallelism per BEAM scheduler.
+- You can pin the per-context memory budget and the modules in use are subinterpreter-safe.
+- You are on Python 3.14+.
+
+Use **worker** (default) when:
+
+- You are on Python 3.12 or 3.13.
+- Calls are short and frequent (every microsecond of overhead matters).
+- You are running modules that are not subinterpreter-safe (some scientific stacks, older C extensions).
+- You are already running free-threaded Python (3.13t+); worker mode gets parallelism for free without the per-interpreter memory cost.
+
+### Common Pitfalls
+
+- **Importing once is not enough.** Imports happen per subinterpreter. Pre-warming a worker context will not pre-warm the OWN_GIL contexts; do it inside each `py_context`.
+- **Sharing Python objects across contexts.** Passing a `PyObject*` reference (via `py_state` or otherwise) between OWN_GIL contexts is undefined behaviour. Round-trip through Erlang terms or ETS-backed state.
+- **Long-running tasks block the dispatcher.** A single OWN_GIL context processes one request at a time. If you have a 30-second compute job, parallelise across contexts; do not queue everything onto context 1.
+- **Callback storms.** Heavy `erlang.call` use inside an OWN_GIL context routes to thread workers, which is fine, but the round-trip cost is then worker-style on top of OWN_GIL dispatch. For tight callback loops, prefer worker mode end-to-end.
 
 ## Benchmarking
 

diff --git a/docs/scalability.md b/docs/scalability.md
@@ -108,6 +108,10 @@ Ctx = py:context(1),
 - Higher memory usage (each interpreter loads modules separately)
 - Some C extensions don't support subinterpreters
 - Requires Python 3.14+
+- Higher per-call latency (~4x worker)
+- Callback re-entry to the same context is restricted (`erlang.call` from inside an OWN_GIL context routes to a thread worker, not back to that context)
+
+For a fuller breakdown of OWN_GIL tradeoffs, common pitfalls, and a usage decision guide, see [OWN_GIL Internals: Pros and Cons](owngil_internals.md#pros-and-cons).
 
 ## Subinterpreter Architecture
 
@@ -144,7 +148,7 @@ Ctx = py:context(1),
 │  │ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │         │
 │  └──────────────┘  └──────────────┘  └──────────────┘         │
 │                                                                 │
-│  Each thread owns its interpreter's GIL (Py_GIL_OWN)           │
+│  Each thread owns its GIL (PyInterpreterConfig_OWN_GIL)        │
 │  No GIL contention between threads                              │
 └─────────────────────────────────────────────────────────────────┘
 ```
@@ -155,7 +159,7 @@ Ctx = py:context(1),
 
 **py_context_process**: Gen_server that owns a Python context reference and handles call/eval/exec operations.
 
-**Subinterpreter Thread Pool (C)**: Manages N threads, each with its own Python subinterpreter created with `Py_NewInterpreterFromConfig()` and `Py_GIL_OWN`.
+**Subinterpreter Thread Pool (C)**: Manages N threads, each with its own Python subinterpreter created with `Py_NewInterpreterFromConfig()` and `PyInterpreterConfig_OWN_GIL`.
 
 ### Request Flow
 

diff --git a/docs/security.md b/docs/security.md
@@ -42,6 +42,7 @@ This provides defense-in-depth - even if Python code tries to import `os` or `su
 
 When blocked operations are attempted, you'll see:
 
+<!-- skip-lint -->
 ```python
 >>> import subprocess
 >>> subprocess.run(['ls'])
@@ -50,6 +51,7 @@ fork()/exec() would corrupt the Erlang runtime.
 Use Erlang ports (open_port/2) for subprocess management.
 ```
 
+<!-- skip-lint -->
 ```python
 >>> import os
 >>> os.fork()

diff --git a/docs/shared-dict.md b/docs/shared-dict.md
@@ -279,16 +279,21 @@ ok = py:shared_dict_destroy(Session).
 %% Create shared cache
 {ok, Cache} = py:shared_dict_new(),
 
-%% Python can populate the cache
+%% Inject the handle into Python globals (py:exec/1 has no locals
+%% argument, so we stash it via py:eval with a side effect).
+{ok, _} = py:eval(
+    <<"(globals().__setitem__('_cache_handle', handle), None)[-1]">>,
+    #{handle => Cache}),
+
+%% Python can now populate the cache
 ok = py:exec(<<"
 from erlang import SharedDict
-cache = SharedDict(handle)
-cache['computed'] = expensive_computation()
-">>,
-ok = py:eval(<<"1">>, #{<<"handle">> => Cache}),
+cache = SharedDict(_cache_handle)
+cache['computed'] = 42
+">>),
 
 %% Erlang can read cached values
-CachedValue = py:shared_dict_get(Cache, <<"computed">>).
+42 = py:shared_dict_get(Cache, <<"computed">>).
 ```
 
 ## See Also