Skip to content

Consolidate FileIO #310

@kevinjqliu

Description

@kevinjqliu

Feature Request / Improvement

Can we consolidate and standardize FileIO to the PyArrow implementation?

There are currently two different FileIO implementations, ARROW_FILE_IO and FSSPEC_FILE_IO. ARROW_FILE_IO uses Apache Arrow's Filesystem Interface while FSSPEC_FILE_IO uses the fsspec library.

Here are a few reasons for consolidating:

  1. PyArrow is already preferred over FsSpec for various FS implementations.

    SCHEMA_TO_FILE_IO: Dict[str, List[str]] = {
    "s3": [ARROW_FILE_IO, FSSPEC_FILE_IO],
    "s3a": [ARROW_FILE_IO, FSSPEC_FILE_IO],
    "s3n": [ARROW_FILE_IO, FSSPEC_FILE_IO],
    "gs": [ARROW_FILE_IO],
    "file": [ARROW_FILE_IO],
    "hdfs": [ARROW_FILE_IO],
    "abfs": [FSSPEC_FILE_IO],
    "abfss": [FSSPEC_FILE_IO],
    }

  2. PyIceberg is becoming more coupled with PyArrow, to_arrow() and pa.Table are widely used for reading and writing, including the new feature create_table with a PyArrow Schema #305

  3. Easier to keep the 2 FileIO's behavior in sync. For example, FsSpec defaults the path with no scheme (/tmp/warehouse) to the file scheme, but PyArrow does not. See #301

  4. The two FileIO implementations are not that different from one another. FsSpec can use its underlying FS implementations, including LocalFileSystem, S3FileSystem, GCSFileSystem, and AzureBlobFileSystem.
    While PyArrow uses its FS implementations including LocalFileSystem, S3FileSystem, HadoopFileSystem, and GcsFileSystem.
    PyArrow is currently missing the HadoopFileSystem implementation but it has support for HDFS.

  5. Fsspec and PyArrow can be used directionally
    PyArrow can use fsspec-based filesystem.
    FsSpec can wrap PyArrow filesystem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions