Comparing feather vs parquet

We decided to go with feather:

  • Feather and Parquet have comparible read/write speed

  • Parquet by default compresses into gzip while feather does not

  • While parquet writes a bit faster without compression, it reads back slower, so overall no big difference

  • The file size of .feather is a lot smaller, even smaller than .parquet with compression

  • When writing as .feather, note:

    • The column names need to be strings

    • The pandas.index won’t write, need to perform df.reset_index()

url = "https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/HUMAN_9606_idmapping_selected.tab.gz"
import pandas as pd
df = pd.read_csv(url, sep="\t", header=None, low_memory=False, compression="gzip")

df.columns = df.columns.astype(str)  # feather requires string column names
df = (
    df.reset_index()
)  # feather does not support serializing <class 'pandas.core.indexes.base.Index'> for the index

Feather by default does not use compression, while parquet uses gzip.

%%time

df.to_feather("human-uniprot.feather")
CPU times: user 887 ms, sys: 298 ms, total: 1.19 s
Wall time: 1.04 s
%%time

df = pd.read_feather("human-uniprot.feather")
CPU times: user 433 ms, sys: 204 ms, total: 637 ms
Wall time: 592 ms
%%time

df.to_parquet("human-uniprot.parquet")
CPU times: user 1.24 s, sys: 311 ms, total: 1.55 s
Wall time: 1.63 s
%%time

df = pd.read_parquet("human-uniprot.parquet")
CPU times: user 737 ms, sys: 455 ms, total: 1.19 s
Wall time: 1.04 s
df.to_parquet("human-uniprot-no-compr.parquet", compression=None)
df = pd.read_parquet("human-uniprot-no-compr.parquet")
! ls -lh (*.parquet|*.feather)
-rw-r--r--  1 sunnysun  staff   407M Jun 29 17:03 human-uniprot-no-compr.parquet
-rw-r--r--  1 sunnysun  staff   184M Jun 29 17:03 human-uniprot.feather
-rw-r--r--  1 sunnysun  staff   201M Jun 29 17:03 human-uniprot.parquet