BinaryOrNot 0.5.0: Zero Dependencies, 128 Bytes, One Trained Classifier

If you've ever had BinaryOrNot misidentify a UTF-16 file, choke on a CJK-encoded document, or crash because chardet changed its API, this release is for you.

This is the biggest release in BinaryOrNot's history. I rebuilt the detection engine from the ground up. The original used byte ratio heuristics with chardet as a second opinion for ambiguous files. I replaced all of that with a trained decision tree operating on 23 features, covering 49 binary formats and 37 text encodings, with zero external dependencies. It's backed by 211 tests and a training pipeline you can re-run yourself.

BinaryOrNot now has zero dependencies. The chardet library (2.1 MB installed) is gone, replaced by a decision tree that reads 128 bytes of a file and classifies it as binary or text using 23 features computed from those bytes alone. The API is unchanged: is_binary("file.png") still returns True.

pip install --upgrade binaryornot

By the numbers

Before (0.4.4)	After (0.5.0)
1 dependency (chardet, 2.1 MB)	0 dependencies
1024 bytes read per file	128 bytes read per file
Byte ratio heuristics + chardet	Trained classifier, 23 features
~12 binary formats	49 binary formats
ASCII + whatever chardet detected	37 text encodings
48 tests	211 tests

What's new

CLI tool. Run binaryornot myfile.png from the command line and get True or False. Thanks @moluwole! (#49)
49 binary formats recognized. PNG, JPEG, GIF, BMP, TIFF, ICO, WebP, PSD, HEIF, PDF, OLE2 (.doc/.xls), SQLite, ZIP, gzip, xz, bzip2, 7z, RAR, Zstandard, ELF, Mach-O, MZ/PE, Java class, WebAssembly, Dalvik DEX, RIFF, Ogg, FLAC, MP4/MOV, MP3, Matroska/WebM, MIDI, WOFF, WOFF2, OTF, TTF, EOT, Apache Parquet, .pyc, .DS_Store, LLVM bitcode, Git packfiles, and more. Every format cites its specification and is verified by magic-byte tests and real file fixtures.
37 text encodings covered. UTF-8, UTF-16, UTF-32, all major single-byte encodings (ISO-8859, Windows code pages, KOI8-R, Mac encodings), and CJK encodings (GB2312, GBK, GB18030, Big5, Shift-JIS, EUC-JP, EUC-KR, ISO-2022-JP). A Big5-encoded Chinese document is correctly identified as text, not binary.
Encoding and format coverage tracked in CSVs. encodings.csv and binary_formats.csv are the single source of truth, feeding training data, parametrized tests, and documentation. Four gaps are documented with reasons (ISO-2022-KR and three EBCDIC code pages).

What's better

8x fewer bytes read per file. The detector reads 128 bytes instead of 1024. The decision tree's features stabilize well within that range.
211 tests, up from 48. Encoding round-trips, binary format magic bytes, real file fixtures for 16 formats, tiny-chunk edge cases, and boundary conditions. The decision tree is trained with balanced class weights and 5 targeted Hypothesis strategies (structured binary, binary with embedded strings, compressed binary, CJK text, whitespace-heavy text).
SQLite databases correctly detected as binary. Thanks @pombredanne! (#44)
Proper error logging for file I/O issues. Uses logger.exception() for better diagnostics when a file can't be read. Thanks @MarshalX! (#629)

What's fixed

chardet 7.0.0 crash (#634). chardet 7 returns {'encoding': None, 'confidence': 0.99}, which crashed is_binary_string() with a TypeError, then crashed the error handler with a NameError from a Python 2 unicode() call. Both crash paths are structurally impossible now because chardet is gone. Thanks @wesleybl for the report!
Unreadable files raise instead of returning False. is_binary() on a nonexistent or permission-denied file now raises FileNotFoundError or PermissionError. Previously it silently returned False, making broken paths indistinguishable from text files.

What's changed

Zero dependencies. pip install binaryornot installs nothing else. chardet is no longer needed.
Python 3.12+ only. Python 2 and older Python 3 versions are no longer supported. All Python 2 compatibility code has been removed.
MIT license (previously BSD).
src/ layout with hatchling build system, replacing setup.py/setup.cfg.

Contributors

@audreyfeldroy (Audrey M. Roy Greenfeld) designed and built this release: the trained decision tree, encoding and binary format coverage matrices, Hypothesis-based training pipeline, fixture generation, documentation, and the complete modernization from Cookiecutter PyPackage.

Thanks to @pombredanne (Philippe Ombredanne) for SQLite detection and binary stream improvements, @moluwole for the CLI tool, @MarshalX (Ilya Siamionau) for better error logging, @thebaptiste for pyproject.toml migration (#633), @wesleybl for reporting the chardet 7 crash (#634), @alcuin2 for binary detection improvements (#48), @olaoluwa-98 for CI updates (#50), and @cosmic-byte for test fixes (#52).