bpo-31650: PEP 552 (Deterministic pycs) implementation#4575
bpo-31650: PEP 552 (Deterministic pycs) implementation#4575
Conversation
780bf48 to
cd28db3
Compare
| against the source at runtime to determine if the pyc needs to be | ||
| regenerated. | ||
|
|
||
| .. attribute:: UNCHECKED_HASH |
There was a problem hiding this comment.
The text below merely says "a hash is included but not validated" but makes no mention of what is validated in this situation. the pyc-invalidation docs cover this in more detail. It seems worth repeating that this just means a hash is generated when generating a pyc but is not checked by default here.
It'd also be good to include a statement of when this might be useful (packaging distribution installed pre-generated pyc files where something else is guaranteeing they cannot go out of sync for example?). Though perhaps that description belongs in import.rst's pyc-invalidation section.
There was a problem hiding this comment.
The text below merely says "a hash is included but not validated" but makes no mention of what is validated in this situation. the pyc-invalidation docs cover this in more detail. It seems worth repeating that this just means a hash is generated when generating a pyc but is not checked by default here.
Yeah, I was trying to funnel everything into import.rst. I will beef up this part a big, though.
It'd also be good to include a statement of when this might be useful (packaging distribution installed pre-generated pyc files where something else is guaranteeing they cannot go out of sync for example?). Though perhaps that description belongs in import.rst's pyc-invalidation section.
I put some commentary about why you might want this in the what's new? entry. Maybe, it should go somewhere more "permanent", too, though.
| if args.workers is not None: | ||
| args.workers = args.workers or None | ||
|
|
||
| ivl_mode = args.invalidation_mode.replace('-', '_').upper() |
There was a problem hiding this comment.
Since you use the names in choices= above to generate the mode name, perhaps use a similar reverse transformation to generate a (sorted/stable-ordered) list of names in the choices tuple above?
choices=sorted(mode.lower().replace('_', '-') for mode in args.invalidation_mode),or similar?
| def _validate_bytecode_header(data, source_stats=None, name=None, path=None): | ||
| """Validate the header of the passed-in bytecode against source_stats (if | ||
| given) and returning the bytecode that can be compiled by compile(). | ||
| def _classify_pyc(data, name, exc_details): |
There was a problem hiding this comment.
Can you document what the arguments are?
name is not obvious - could it be given a better variable name? is it the pyc file's basename or the full pathname?
exc_details is unobvious.
There was a problem hiding this comment.
A similar comment applies to all other new function def's below.
Lib/importlib/_bootstrap_external.py
Outdated
| return flags | ||
|
|
||
|
|
||
| def _validate_timestamp_pyc(data, source_stats, name, exc_details): |
There was a problem hiding this comment.
It seems suspicious that this raises no error when the source_stats dict-like object fails to contain mtime or size members.
If this is intentional, document why.
There was a problem hiding this comment.
Yeah, I'll clean this up.
|
|
||
| def _code_to_timestamp_pyc(code, mtime=0, source_size=0): | ||
| "Produce the data for a timestamp-based pyc." | ||
| data = bytearray(MAGIC_NUMBER) |
There was a problem hiding this comment.
It is unfortunate that the constant is called MAGIC_NUMBER when in fact it is a bytes object containing the binary magic data. Can it be renamed as this is "just" _bootstrap_external or would that break some API somewhere?
bytearray(MAGIC_NUMBER) looks suspicious if a reader naively assumes that MAGIC_NUMBER is an integer...
There was a problem hiding this comment.
Yeah, it turns out the integer version is actually called _RAW_MAGIC_NUMBER, which a priori sounds more likely to be a bytearray. oops
At any rate, I think it's correct that this constant is exposed as some bytes not a number, since the only valid operation is to write it into a file or compare it to some bytes from a file.
Lib/importlib/_bootstrap_external.py
Outdated
| source_stats=st, name=fullname, | ||
| path=bytecode_path) | ||
| flags = _classify_pyc(data, fullname, exc_details) | ||
| bytes_data = data[16:] |
There was a problem hiding this comment.
Consider using memoryview(data)[16:] to avoid a large copy assuming _compile_bytecode is happy with that (it appears to be, marshal.loads() accepts memoryview)
Lib/importlib/_bootstrap_external.py
Outdated
| 'path': path, | ||
| } | ||
| _classify_pyc(data, fullname, exc_details) | ||
| return _compile_bytecode(data[16:], name=fullname, bytecode_path=path) |
There was a problem hiding this comment.
memoryview(data)[16:] if you also do this in the other place i made this comment.
| @@ -5,18 +5,25 @@ | |||
| from ._bootstrap import spec_from_loader | |||
| from ._bootstrap import _find_spec | |||
| from ._bootstrap_external import MAGIC_NUMBER | |||
There was a problem hiding this comment.
This is a public API exposing MAGIC_NUMBER so it's probably best to ignore my previous comment about changing the name of the thing to avoid confusion over its type.
Lib/modulefinder.py
Outdated
| self.msgout(2, "raise ImportError: " + str(exc), pathname) | ||
| raise | ||
| co = marshal.loads(marshal_data) | ||
| co = marshal.loads(data[16:]) |
Lib/py_compile.py
Outdated
| bytecode = importlib._bootstrap_external._code_to_hash_pyc( | ||
| code, | ||
| source_hash, | ||
| invalidation_mode == PycInvalidationMode.CHECKED_HASH, |
There was a problem hiding this comment.
A == test to get a boolean value in a list of args is easy to misread as a typo for a keyword argument being passed. Disambiguate by adding ()s around it or passing this by name?
benjaminp
left a comment
There was a problem hiding this comment.
Thanks for the comments! Update incoming.
cd28db3 to
737f925
Compare
brettcannon
left a comment
There was a problem hiding this comment.
Pretty much all of my comments are minor, so no need to have to double-check any changes made based on my feedback.
Doc/library/compileall.rst
Outdated
| .. cmdoption:: --invalidation-mode [timestamp|checked-hash|unchecked-hash] | ||
|
|
||
| Control how the generated pycs will be invalidated at runtime. The default | ||
| setting, ``timestamp``, means that pyc files with the source timestamp and |
There was a problem hiding this comment.
I just wanted to double-check that we are referring to the files as "pyc" and "pycs" throughout the rest of the docs instead of ".pyc files" or something? Basically just checking we're being consistent.
Doc/library/importlib.rst
Outdated
|
|
||
| .. function:: source_hash(source_bytes) | ||
|
|
||
| Return the hash of *source_bytes* as byte string. A hash-based pyc embeds the |
There was a problem hiding this comment.
"as byte string" -> "as bytes".
Doc/reference/import.rst
Outdated
| metadata. | ||
|
|
||
| Python also supports "hash-based" cache files, which store a hash of a source | ||
| file contents rather than its metadata. There are two variants of hash-based |
There was a problem hiding this comment.
"source file contents" -> "source file's contents"
Doc/reference/import.rst
Outdated
| hash in the cache file. If a checked hash-based cache file is found to be | ||
| invalid, Python regenerates it and writes a new checked hash-based cache | ||
| file. For unchecked hash-based pycs, Python simply assumes the cache file is | ||
| valid if it exists. Hash-based pyc validation behavior may be override with the |
Lib/importlib/_bootstrap_external.py
Outdated
| elif len(raw_timestamp) != 4: | ||
| message = 'reached EOF while reading timestamp in {!r}'.format(name) | ||
| if len(data) < 16: | ||
| message = 'reached EOF while reading pyc header of {!r}'.format(name) |
There was a problem hiding this comment.
Might as well change this to an f-string.
Lib/importlib/_bootstrap_external.py
Outdated
| flags = _r_long(data[4:8]) | ||
| # Only the first two flags are defined. | ||
| if flags & ~0b11: | ||
| raise ImportError('invalid flags {!r} in {!r}'.format(flags, name), |
Lib/importlib/_bootstrap_external.py
Outdated
|
|
||
| """ | ||
| if _r_long(data[8:12]) != (source_mtime & 0xFFFFFFFF): | ||
| message = 'bytecode is stale for {!r}'.format(name) |
Lib/importlib/_bootstrap_external.py
Outdated
| raise ImportError(message, **exc_details) | ||
| if (source_size is not None and | ||
| _r_long(data[12:16]) != (source_size & 0xFFFFFFFF)): | ||
| raise ImportError('bytecode is stale for {!r}'.format(name), |
| with util.create_modules('_temp') as mapping: | ||
| bc_path = self.manipulate_bytecode('_temp', mapping, | ||
| lambda bc: bc[:12], | ||
| lambda bc: bc[:16], |
There was a problem hiding this comment.
I should have made these constants instead of using a literal when I originally wrote this code. Sorry about that. :(
Python/import.c
Outdated
| if (d == NULL) | ||
| goto failure; | ||
| PyObject *pyc_mode = PyUnicode_FromString(_Py_CheckHashBasedPycsMode); | ||
| if (pyc_mode == NULL) |
737f925 to
b3a7070
Compare
gpshead
left a comment
There was a problem hiding this comment.
some minor comments to address but overall i believe this is good to go in when you are ready.
thanks!
Modules/main.c
Outdated
| -x : skip first line of source, allowing use of non-Unix forms of #!cmd\n\ | ||
| -X opt : set implementation-specific option\n\ | ||
| --check-hash-based-pycs always|default|never:\n\ | ||
| control how Python invalidates hash-based .pcy files\n\ |
Modules/perfmap.c
Outdated
| }; | ||
|
|
||
| PyDoc_STRVAR(module_doc, | ||
| "perfmap module."); |
There was a problem hiding this comment.
eek, now you know what crud I have sitting about in my checkout.
| if (flags != 0) { | ||
| // Hash-based pyc. We currently refuse to handle checked hash-based | ||
| // pycs. We could validate hash-based pycs against the source, but it | ||
| // seems likely that most people putting hash-based pycs in a zipfile |
Python now supports checking bytecode cache up-to-dateness with a hash of the source contents rather than volatile source metadata. See the PEP for details. While a fairly straightforward idea, quite a lot of code had to be modified due to the pervasiveness of pyc implementation details in the codebase. Changes in this commit include: - The core changes to importlib to understand how to read, validate, and regenerate hash-based pycs. - Support for generating hash-based pycs in py_compile and compileall. - Modifications to our siphash implementation to support passing a custom key. We then expose it to importlib through _imp. - Updates to all places in the interpreter, standard library, and tests that manually generate or parse pyc files to grok the new format. - Support in the interpreter command line code for long options like --check-hash-based-pycs. - Tests and documentation for all of the above.
e5d54da to
f498a91
Compare
|
Thanks again for the reviews, Brett and Greg! |
Python now supports checking bytecode cache up-to-dateness with a hash of the
source contents rather than volatile source metadata. See the PEP for details.
While a fairly straightforward idea, quite a lot of code had to be modified due
to the pervasiveness of pyc implementation details in the codebase. Changes in
this commit include:
The core changes to importlib to understand how to read, validate, and
regenerate hash-based pycs.
Support for generating hash-based pycs in py_compile and compileall.
Modifications to our siphash implementation to support passing a custom
key. We then expose it to importlib through _imp.
Updates to all places in the interpreter, standard library, and tests that
manually generate or parse pyc files to grok the new format.
Support in the interpreter command line code for long options like
--check-hash-based-pycs.
Tests and documentation for all of the above.
https://bugs.python.org/issue31650