-
Notifications
You must be signed in to change notification settings - Fork 416
Description
Feature Request / Improvement
Can we consolidate and standardize FileIO to the PyArrow implementation?
There are currently two different FileIO implementations, ARROW_FILE_IO and FSSPEC_FILE_IO. ARROW_FILE_IO uses Apache Arrow's Filesystem Interface while FSSPEC_FILE_IO uses the fsspec library.
Here are a few reasons for consolidating:
-
PyArrow is already preferred over FsSpec for various FS implementations.
iceberg-python/pyiceberg/io/__init__.py
Lines 273 to 282 in cd7fb50
SCHEMA_TO_FILE_IO: Dict[str, List[str]] = { "s3": [ARROW_FILE_IO, FSSPEC_FILE_IO], "s3a": [ARROW_FILE_IO, FSSPEC_FILE_IO], "s3n": [ARROW_FILE_IO, FSSPEC_FILE_IO], "gs": [ARROW_FILE_IO], "file": [ARROW_FILE_IO], "hdfs": [ARROW_FILE_IO], "abfs": [FSSPEC_FILE_IO], "abfss": [FSSPEC_FILE_IO], } -
PyIceberg is becoming more coupled with PyArrow,
to_arrow()andpa.Tableare widely used for reading and writing, including the new featurecreate_tablewith a PyArrow Schema #305 -
Easier to keep the 2 FileIO's behavior in sync. For example, FsSpec defaults the path with no scheme (
/tmp/warehouse) to thefilescheme, but PyArrow does not. See #301 -
The two FileIO implementations are not that different from one another. FsSpec can use its underlying FS implementations, including
LocalFileSystem,S3FileSystem,GCSFileSystem, andAzureBlobFileSystem.
While PyArrow uses its FS implementations includingLocalFileSystem,S3FileSystem,HadoopFileSystem, andGcsFileSystem.
PyArrow is currently missing theHadoopFileSystemimplementation but it has support for HDFS. -
Fsspec and PyArrow can be used directionally
PyArrow can use fsspec-based filesystem.
FsSpec can wrap PyArrow filesystem.