enable support for parquet for AWS S3 target.
Enabling support for parquet will help provide better solution for datalake users. Apache Parquet is open source file format. Parquet is designed for efficient as well as performant flat columnar storage format for data as compared to csv files. Parquet works much bette rand efficient with complex data in bulk and features efficient data compression and ecoding.
Please find a sample comparision between Parquet and CSV with AWS S3 in terms of savings and speed converting data in Parquet and CSV:
Dataset | AWS S3 Size | Query Time | Data Scanned | Cost ($) |
Data stored as CSV | 1 TB | 250 seconds | 1.15 TB | $6 |
Data stored as Parquet | 130 GB | 8 seconds | 2.72 GB | 0.03 |
Savings | 87% less with parquet | 34 times faster | 99% less data scanning | 99% more savings |