BinaryOrNot 0.5.0: Zero Dependencies, 128 Bytes, One Trained Classifier

If you've ever had BinaryOrNot misidentify a UTF-16 file, choke on a CJK-encoded document, or crash because chardet changed its API, this release is for you.

This is the biggest release in BinaryOrNot's history. I rebuilt the detection engine from the ground up. The original used byte ratio heuristics with chardet as a second opinion for ambiguous files. I replaced all of that with a trained decision tree operating on 23 features, covering 49 binary formats and 37 text encodings, with zero external dependencies. It's backed by 211 tests and a training pipeline you can re-run yourself.

BinaryOrNot now has zero dependencies. The chardet library (2.1 MB installed) is gone, replaced by a decision tree that reads 128 bytes of a file and classifies it as binary or text using 23 features computed from those bytes alone. The API is unchanged: is_binary("file.png") still returns True.

pip install --upgrade binaryornot

By the numbers

Before (0.4.4) After (0.5.0)
1 dependency (chardet, 2.1 MB) 0 dependencies
1024 bytes read per file 128 bytes read per file
Byte ratio heuristics + chardet Trained classifier, 23 features
~12 binary formats 49 binary formats
ASCII + whatever chardet detected 37 text encodings
48 tests 211 tests

What's new

  • CLI tool. Run binaryornot myfile.png from the command line and get True or False. Thanks @moluwole! (#49)

  • 49 binary formats recognized. PNG, JPEG, GIF, BMP, TIFF, ICO, WebP, PSD, HEIF, PDF, OLE2 (.doc/.xls), SQLite, ZIP, gzip, xz, bzip2, 7z, RAR, Zstandard, ELF, Mach-O, MZ/PE, Java class, WebAssembly, Dalvik DEX, RIFF, Ogg, FLAC, MP4/MOV, MP3, Matroska/WebM, MIDI, WOFF, WOFF2, OTF, TTF, EOT, Apache Parquet, .pyc, .DS_Store, LLVM bitcode, Git packfiles, and more. Every format cites its specification and is verified by magic-byte tests and real file fixtures.

  • 37 text encodings covered. UTF-8, UTF-16, UTF-32, all major single-byte encodings (ISO-8859, Windows code pages, KOI8-R, Mac encodings), and CJK encodings (GB2312, GBK, GB18030, Big5, Shift-JIS, EUC-JP, EUC-KR, ISO-2022-JP). A Big5-encoded Chinese document is correctly identified as text, not binary.

  • Encoding and format coverage tracked in CSVs. encodings.csv and binary_formats.csv are the single source of truth, feeding training data, parametrized tests, and documentation. Four gaps are documented with reasons (ISO-2022-KR and three EBCDIC code pages).

What's better

  • 8x fewer bytes read per file. The detector reads 128 bytes instead of 1024. The decision tree's features stabilize well within that range.

  • 211 tests, up from 48. Encoding round-trips, binary format magic bytes, real file fixtures for 16 formats, tiny-chunk edge cases, and boundary conditions. The decision tree is trained with balanced class weights and 5 targeted Hypothesis strategies (structured binary, binary with embedded strings, compressed binary, CJK text, whitespace-heavy text).

  • SQLite databases correctly detected as binary. Thanks @pombredanne! (#44)

  • Proper error logging for file I/O issues. Uses logger.exception() for better diagnostics when a file can't be read. Thanks @MarshalX! (#629)

What's fixed

  • chardet 7.0.0 crash (#634). chardet 7 returns {'encoding': None, 'confidence': 0.99}, which crashed is_binary_string() with a TypeError, then crashed the error handler with a NameError from a Python 2 unicode() call. Both crash paths are structurally impossible now because chardet is gone. Thanks @wesleybl for the report!

  • Unreadable files raise instead of returning False. is_binary() on a nonexistent or permission-denied file now raises FileNotFoundError or PermissionError. Previously it silently returned False, making broken paths indistinguishable from text files.

What's changed

  • Zero dependencies. pip install binaryornot installs nothing else. chardet is no longer needed.
  • Python 3.12+ only. Python 2 and older Python 3 versions are no longer supported. All Python 2 compatibility code has been removed.
  • MIT license (previously BSD).
  • src/ layout with hatchling build system, replacing setup.py/setup.cfg.

Contributors

@audreyfeldroy (Audrey M. Roy Greenfeld) designed and built this release: the trained decision tree, encoding and binary format coverage matrices, Hypothesis-based training pipeline, fixture generation, documentation, and the complete modernization from Cookiecutter PyPackage.

Thanks to @pombredanne (Philippe Ombredanne) for SQLite detection and binary stream improvements, @moluwole for the CLI tool, @MarshalX (Ilya Siamionau) for better error logging, @thebaptiste for pyproject.toml migration (#633), @wesleybl for reporting the chardet 7 crash (#634), @alcuin2 for binary detection improvements (#48), @olaoluwa-98 for CI updates (#50), and @cosmic-byte for test fixes (#52).