- Configure a
Changelog
link forpypi
to display
- Upgrade
PyPDF2
2.x topypdf
5.0.1 (new name, same package) - Add
--image-quality
option tocombine_pdfs
tool
- Add
--no-default-yara-rules
command line option so users can use only their own custom YARA rules files if they want. Previously you could only use custom YARA rules in addition to the default rules; now you can just skip the default rules.
- Add
combine_pdfs
command line script to merge a bunch of PDFs into one - Remove unused
Deprecated
dependency
- Add
malware_MaldocinPDF
YARA rule
- Handle internal YARA errors more gracefully with error messages instead of crashes (currently seeing
ERROR_TOO_MANY_RE_FIBERS
on macOS on some files for unknown reasons that we hope will go away eventually)
- Bump
yaralyzer
version to 0.9.4 (and thus bumpyara-python
to 4.3.0+) - Remove unused imports, remove unused
requirements.txt
file.
- Fix issue where additional YARA rules supplied with
--yara-file
option were not being used
- Update PDF matching YARA rules file (@anotherbridge)
- Bump version for pypi tag / shield image (@MartinThoma)
- Bump version for release
- Bump
yaralyzer
version to handleyara-python
breaking change
- Fix export filename
- Add
--preview-stream-length
option - Store parsed
args
onPdfalyzerConfig
class - Yaralyzer CLI options all configurable with env vars.
- Fix infinite loop bug encountered when building some char maps
- Add all the possible PDF internal commands that can lead to JavaScript execution or local/remote command exection to
DANGEROUS_PDF_KEYS
list.
- New
--extract-quoted
argument can be specified to haveyaralyzer
extract and decode all bytes between the specified quote chars. - Quoted bytes are no longer force decoded by default.
- New
--suppress-boms
argument suppresses BOM search.
- Fix PER ENCODING METRICS subtable in decodings stats table
- Add percentage calculations to decoding attempts table
--log-level
option (fromyaralyzer
)
- Refactor
PdfTreeVerifier
andIndeterminateNode
out ofPdfalyzer
class
- Check for any explicit
/Kids
relationships when placing indeterminate nodes - All other things being equal prefer a single
/Page
or/Pages
referrer as the parent
- Rich table view displays object properties and referenced nodes with appropriate color and labeling
- Style
/Encoding
objects as part of the font family - Refactor text coloring/styling to
pdfalyzer.output.styles
package
- Launchable with
python -m pdfalyzer
for those who can't getpdfalyze
script to work (h/t @MartinThoma)
- Last ditch attempt to place indeterminate nodes according to which node has most descendants catches almost everything
- Refactor
PdfalyzerPresenter
class to handle output formatting.
- Fix parent/child issue with
/Annots
arrays being indeterminate - Fix issue with
/ColorSpace
node placement
- Add
sub_type
to node label - Handle unsupported stream filters (e.g.
/JBIG2Decode
) more gracefully - Suppress spurious warnings about multiple refs
- Handle edge case
/Resources
node placement - Refactor
pdf_object_properties.py
decorator - Show embedded streams table in
--docinfo
output - Unify indeterminate node tree placement logic (
/Resources
are not special)
- Bump dependencies
- Fix regressions
- Fix issue when
/Resources
is referred to by multiple addresses from different nodes
- Scan all binaries (not just font binaries) with included PDF related YARA rules
- Better warning about stream decode failures
- Remove warnings that should not be warnings
- Refactor rich table view code to
pdf_node_rich_table.py
- Refactor
Relationship
andPdfObjectRef
to single class,PdfObjectRelationship
- Fix
importlib.resources
usage in case pdfalyer is packaged as a zip file /Names
is an indeterminate reference type- Catch stream decode exceptions and show error instead of failing.
- Improve the handling of ColorSpace and Resources nodes
- Improve the handling of indeterminate and pure reference nodes (again)
- Improve the handling of indeterminate and pure reference nodes
- Fix bug with unescaped string in section header
- Fix bug with discovery of packaged
.yara
files - More PDF YARA rules from
lprat
- Bump deps
- Use
rich_argparse_plus
for help text
--streams
arg now takes an optional PDF object ID--fonts
no longer takes an optional PDF object ID- YARA matches will display more than 512 bytes
- Improved output formatting
- Scan all binary streams, not just fonts. Separate
--streams
option is provided. (--font
option has much less output) - Display MD5, SHA1, and SHA256 for all binary streams as well as overall file
- Highlight suspicious instructions in red, BOMs in green
- Reenable guillemet quote matching
- Clearer labeling of binary scan results
- Sync with
yaralyzer
v0.4.0
- Sync with
yaralyzer
v0.3.3
- Show defaults and valid values for command line options
- Add table of stream lengths for PDF objects containing streams to
--doc-info
output - Quote extraction API methods should use yara, not bespoke extraction
- Fix bug with rich tree view of non binary streams
- Use
yaralyzer
as the match engine - Scan all binary streams, not just the fonts
- Integrate YARA scanning - all the rules I could dig up relating to PDFs
- Add MD5, SHA1, SHA256 to document info section
pdfalyzer_show_color_theme
script shows the theme- Make
README
more PyPi friendly
Bunch of small changes to support releasing on pypi
- Invoke with shell command
pdfalyze
instead of local python file./pdfalyzer.py
(options are the same) - Core class renames:
PdfWalker
->Pdfalyzer
,DataStreamHandler
->BinaryScanner
- Permanent env var configuration moved from a file called
.env
to a file called.pdfalyzer
- Logging to a file is off unless configured by env var
- To use Didier Stevens's
pdf-parser.py
you must provide thePDFALYZER_PDF_PARSER_PY_PATH
env var
- Hexadecimal representation of matched bytes in decode attempts table
--quote-type
option to limit binary scans--min-decode-length
option to skip decode attempts on short matches--file-suffix
option- Output filenames will contain some of the options used to generate them
- Add runtime params to export filenames where it is material to the output
- Ensure
/OpenAction
etc are not subsumed by parent/child relationships in the condensed tree view - Tweak available configuration options for logging to file.
- Fix bug with validating directly embedded objects
- Improved scanning of binaries for
UTF-X
encoded data where X is not a prime number. - Lots of summary data is now displayed about what were the most and least successful encodings at extracting some meaning (or at least not failing) from binary sequences surrounded by quote chars, frong slashes, backticks, etc etc.
- Will execute "by the book" decodes using normally untested encodings if the
chardet.detect()
library feels strongly enough about it. - Exporting SVGs, HTML, and colored text can be done in a single invocation.
- Invocations of the tool are now logged in a history file
log/pdfalyzer.invocation.log
- Logging to a file can be enabled by setting a
PDFALYZER_LOG_DIR
environment variable but see comments in.pdfalyzer.example
about side effects.
--maximize-width
arg means you can set yr monitor to teeny tiny fonts and print out absolutely monstrous SVGs (yay!)--chardet-cutoff
option lets you control the the cutoff for adding untested encodings to the output based on whatchardet.detect()
thinks is the right encoding--suppress-chardet
command line option removes the chardet tables that are (mostly) duplicative of the decoded text tables--output-dir
and--file-prefix
are now shared by all the export modes- You can use
dotenv
to permanently turn on or off or change the value of some command line options; see.pdfalyzer.example
for mdetails on what is configurable.
- Default
TerminalTheme
colors kind of sucked when you went to export SVGs and HTML... like black was not black, or even close. Things are simpler now - black is black, blue is blue, etc. Makes exports look better.
- Binary data highlighting now goes all the way to the end of the matched string in most cases (small bug had it falling 1-4 chars behind sometimes)
- Fix small bug with exporting font/binary details to SVGs
- Fix `Win-
BytesMatch
class to keep track of binary regex matches- Group suppression notifications together
- Dramatic expansion in the
pdfalyzer
's binary data scouring capabilities:- Add
chardet
library guesses as to the encoding of all unknown byte sequences and ranks them from most to least likely - Add attempted decodes of all backtick, frontslash, single, double, and guillemet quoted strings in font binaries
- Add decode attempts with
Windows-1252
,UTF-7
, andUTF-16
encodings - Add
--suppress-decodes
to suppress attempted decodes of quoted strings in font binaries - Cool art gets generated when you swarm a binaries quoted strings, which are mostly but not totally random
- Add
- The
--font
option takes an optional argument to limit the output to a single font ID - Add
--max-decode-length
to suppress attempted decodes of quoted strings in font binaries over a certain length - Add
--surrounding
option to specify number of bytes to print/decode before and after suspicious bytes; decrease default number of surrounding bytes - Add
--version
option extract_guillemet_quoted_bytes()
andextract_backtick_quoted_bytes()
are now iterators- Fix scanning for
UTF-16
BOM in font binary
- Print unprintable ascii characters in the font binary forced decode attempt with a bracket notation. e.g. print the string
[BACKSPACE]
instead of deleting the previous character - Add an attempt to decode font binary as
latin-1
in addition toutf-8
- Highlight the suspicious bytes in the font binary forced decode attempts
- Fix printing of suspicious font bytes when suspicions are near start or end of stream
- Color
/Widths
tables - Color
/Catalog
and other summary nodes with with green - Color
ByteStringObject
like bytes - Resolve types of
IndirectObject
refs appearing indict
andlist
value Rich Tree table rows - Remove redundant
/First
and/Last
non tree refs when those relationships are part of the tree - Couple of edge case bug fixes
- Fix issue with directly embedded
/Resources
not being walked correctly (along with their fonts) - Introduce
PdfObjectRelationship
tuple to contain the root reference key, the actual reference address string, and the referenced obj - Add warnings if any PDF objects are missing from the tree
- Initial release
- Change command line option style (capital letters for debugging, 3 letter codes for export)
- No need to explicitly call
walk_pdf()
- Fix parent/child relationships for
/StructElem