Comparing feather vs parquet¶
We decided to go with feather:
Feather and Parquet have comparible read/write speed
Parquet by default compresses into gzip while feather does not
While parquet writes a bit faster without compression, it reads back slower, so overall no big difference
The file size of .feather is a lot smaller, even smaller than .parquet with compression
When writing as .feather, note:
The column names need to be strings
The pandas.index won’t write, need to perform
df.reset_index()
url = "https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/HUMAN_9606_idmapping_selected.tab.gz"
import pandas as pd
df = pd.read_csv(url, sep="\t", header=None, low_memory=False, compression="gzip")
df.columns = df.columns.astype(str) # feather requires string column names
df = (
df.reset_index()
) # feather does not support serializing <class 'pandas.core.indexes.base.Index'> for the index
Feather by default does not use compression, while parquet uses gzip.
%%time
df.to_feather("human-uniprot.feather")
CPU times: user 887 ms, sys: 298 ms, total: 1.19 s
Wall time: 1.04 s
%%time
df = pd.read_feather("human-uniprot.feather")
CPU times: user 433 ms, sys: 204 ms, total: 637 ms
Wall time: 592 ms
%%time
df.to_parquet("human-uniprot.parquet")
CPU times: user 1.24 s, sys: 311 ms, total: 1.55 s
Wall time: 1.63 s
%%time
df = pd.read_parquet("human-uniprot.parquet")
CPU times: user 737 ms, sys: 455 ms, total: 1.19 s
Wall time: 1.04 s
df.to_parquet("human-uniprot-no-compr.parquet", compression=None)
df = pd.read_parquet("human-uniprot-no-compr.parquet")
! ls -lh (*.parquet|*.feather)
-rw-r--r-- 1 sunnysun staff 407M Jun 29 17:03 human-uniprot-no-compr.parquet
-rw-r--r-- 1 sunnysun staff 184M Jun 29 17:03 human-uniprot.feather
-rw-r--r-- 1 sunnysun staff 201M Jun 29 17:03 human-uniprot.parquet