Ray: Remote Code Execution via Parquet Arrow Extension Type Deserialization
Ray Data registers custom Arrow extension types (ray.data.arrow_tensor, ray.data.arrow_tensor_v2, ray.data.arrow_variable_shaped_tensor) globally in PyArrow. When PyArrow reads a Parquet file containing one of these extension types, it calls arrow_ext_deserialize on the field's metadata bytes. Ray's implementation passes these bytes directly to cloudpickle.loads(), achieving arbitrary code execution during schema parsing, before any row data is read. In May 2024, Ray fixed a related vulnerability in PyExtensionType-based extension types (issue #41314, PR #45084). …