Parquet ファイルをサクッと読むときに parquet
コマンドを使うと便利❗️
Homebrew なら brew install parquet-cli
コマンドで簡単にセットアップできる.今回は最新 v1.12.3 を使う.そして parquet
コマンドの実装は GitHub の apache/parquet-mr リポジトリにある.parquet help
コマンドの出力だと引数まで細かく把握できず,実装を確認する必要がある.実装は parquet-cli/src/main/java/org/apache/parquet/cli/commands/
ディレクトリにある.
$ brew install parquet-cli $ brew info parquet-cli ==> parquet-cli: stable 1.12.3 (bottled), HEAD
あと関連記事を調べると parquet-tools
も紹介されていたりするけど,現時点では使えなくなっている🔥
$ brew install parquet-tools Error: parquet-tools has been disabled because it is deprecated upstream!
parquet
コマンド
parquet help
コマンドで確認できる通り,サブコマンドは多くある.今回は代表的なものを紹介する.
parquet meta
parquet schema
parquet head
parquet cat
parquet column-size
$ parquet help Usage: parquet [options] [command] [command options] Options: -v, --verbose, --debug Print extra debugging information Commands: help Retrieves details on the functions of other commands meta Print a Parquet file's metadata pages Print page summaries for a Parquet file dictionary Print dictionaries for a Parquet column check-stats Check Parquet files for corrupt page and column stats (PARQUET-251) schema Print the Avro schema for a file csv-schema Build a schema from a CSV data sample convert-csv Create a file from CSV data convert Create a Parquet file from a data file to-avro Create an Avro file from a data file cat Print the first N records from a file head Print the first N records from a file column-index Prints the column and offset indexes of a Parquet file column-size Print the column sizes of a parquet file prune Prune column(s) in a Parquet file and save it to a new file. The columns left are not changed. trans-compression Translate the compression from one to another (It doesn't support bloom filter feature yet). masking Replace columns with masked values and write to a new Parquet file footer Print the Parquet file footer in json format Examples: # print information for meta parquet help meta See 'parquet help <command>' for more information on a specific command.
Parquet サンプル
今回は TLC Trip Record Data で公開されている「タクシーデータセット🚕」を使う.Parquet フォーマットで取得できる.
- 2022/01
- Yellow Taxi Trip Records
yellow_tripdata_2022-01.parquet
- 36 MB
🪓 parquet meta
コマンド
Parquet ファイルのメタデータを取得できる.一部は割愛して載せているけど,スキーマ情報や統計情報を確認できる.
$ parquet meta yellow_tripdata_2022-01.parquet File path: yellow_tripdata_2022-01.parquet Created by: parquet-cpp-arrow version 7.0.0 (中略) Schema: message schema { optional int64 VendorID; optional int64 tpep_pickup_datetime (TIMESTAMP(MICROS,false)); optional int64 tpep_dropoff_datetime (TIMESTAMP(MICROS,false)); optional double passenger_count; optional double trip_distance; optional double RatecodeID; optional binary store_and_fwd_flag (STRING); optional int64 PULocationID; optional int64 DOLocationID; optional int64 payment_type; optional double fare_amount; optional double extra; optional double mta_tax; optional double tip_amount; optional double tolls_amount; optional double improvement_surcharge; optional double total_amount; optional double congestion_surcharge; optional double airport_fee; } Row group 0: count: 2463931 15.47 B records start: 4 total(compressed): 36.361 MB total(uncompressed):63.526 MB -------------------------------------------------------------------------------- type encodings count avg size nulls min / max VendorID INT64 G _ R 2463931 0.13 B 0 "1" / "6" tpep_pickup_datetime INT64 G _ R_ F 2463931 3.84 B 0 "2008-12-31T22:23:09.000000" / "2022-05-18T20:41:57.000000" tpep_dropoff_datetime INT64 G _ R_ F 2463931 3.92 B 0 "2008-12-31T23:06:56.000000" / "2022-05-18T20:47:45.000000" passenger_count DOUBLE G _ R 2463931 0.20 B 71503 "-0.0" / "9.0" trip_distance DOUBLE G _ R 2463931 1.39 B 0 "-0.0" / "306159.28" RatecodeID DOUBLE G _ R 2463931 0.06 B 71503 "1.0" / "99.0" store_and_fwd_flag BINARY G _ R 2463931 0.03 B 71503 "N" / "Y" PULocationID INT64 G _ R 2463931 0.78 B 0 "1" / "265" DOLocationID INT64 G _ R 2463931 0.99 B 0 "1" / "265" payment_type INT64 G _ R 2463931 0.15 B 0 "0" / "5" fare_amount DOUBLE G _ R 2463931 1.01 B 0 "-480.0" / "401092.32" extra DOUBLE G _ R 2463931 0.22 B 0 "-4.5" / "33.5" mta_tax DOUBLE G _ R 2463931 0.02 B 0 "-0.5" / "16.59" tip_amount DOUBLE G _ R 2463931 1.06 B 0 "-125.22" / "888.88" tolls_amount DOUBLE G _ R 2463931 0.09 B 0 "-31.4" / "193.3" improvement_surcharge DOUBLE G _ R 2463931 0.01 B 0 "-0.3" / "0.3" total_amount DOUBLE G _ R 2463931 1.42 B 0 "-480.3" / "401095.62" congestion_surcharge DOUBLE G _ R 2463931 0.09 B 71503 "-2.5" / "2.5" airport_fee DOUBLE G _ R 2463931 0.07 B 71503 "-1.25" / "1.25"
🪓 parquet schema
コマンド
Parquet ファイルの Avro スキーマ情報を取得できる.
$ parquet schema yellow_tripdata_2022-01.parquet { "type" : "record", "name" : "schema", "fields" : [ { "name" : "VendorID", "type" : [ "null", "long" ], "default" : null }, { "name" : "tpep_pickup_datetime", "type" : [ "null", { "type" : "long", "logicalType" : "timestamp-micros" } ], "default" : null }, { "name" : "tpep_dropoff_datetime", "type" : [ "null", { "type" : "long", "logicalType" : "timestamp-micros" } ], "default" : null }, { "name" : "passenger_count", "type" : [ "null", "double" ], "default" : null }, { "name" : "trip_distance", "type" : [ "null", "double" ], "default" : null }, { "name" : "RatecodeID", "type" : [ "null", "double" ], "default" : null }, { "name" : "store_and_fwd_flag", "type" : [ "null", "string" ], "default" : null }, { "name" : "PULocationID", "type" : [ "null", "long" ], "default" : null }, { "name" : "DOLocationID", "type" : [ "null", "long" ], "default" : null }, { "name" : "payment_type", "type" : [ "null", "long" ], "default" : null }, { "name" : "fare_amount", "type" : [ "null", "double" ], "default" : null }, { "name" : "extra", "type" : [ "null", "double" ], "default" : null }, { "name" : "mta_tax", "type" : [ "null", "double" ], "default" : null }, { "name" : "tip_amount", "type" : [ "null", "double" ], "default" : null }, { "name" : "tolls_amount", "type" : [ "null", "double" ], "default" : null }, { "name" : "improvement_surcharge", "type" : [ "null", "double" ], "default" : null }, { "name" : "total_amount", "type" : [ "null", "double" ], "default" : null }, { "name" : "congestion_surcharge", "type" : [ "null", "double" ], "default" : null }, { "name" : "airport_fee", "type" : [ "null", "double" ], "default" : null } ] }
🪓 parquet head
コマンド
Parquet ファイルのデータを取得できる(デフォルトは先頭10件).--num-records
オプションで件数を変えられる.
$ parquet head --num-records 2 yellow_tripdata_2022-01.parquet | jq . { "VendorID": 1, "tpep_pickup_datetime": 1640997340000000, "tpep_dropoff_datetime": 1640998409000000, "passenger_count": 2, "trip_distance": 3.8, "RatecodeID": 1, "store_and_fwd_flag": "N", "PULocationID": 142, "DOLocationID": 236, "payment_type": 1, "fare_amount": 14.5, "extra": 3, "mta_tax": 0.5, "tip_amount": 3.65, "tolls_amount": 0, "improvement_surcharge": 0.3, "total_amount": 21.95, "congestion_surcharge": 2.5, "airport_fee": 0 } { "VendorID": 1, "tpep_pickup_datetime": 1640997223000000, "tpep_dropoff_datetime": 1640997727000000, "passenger_count": 1, "trip_distance": 2.1, "RatecodeID": 1, "store_and_fwd_flag": "N", "PULocationID": 236, "DOLocationID": 42, "payment_type": 1, "fare_amount": 8, "extra": 0.5, "mta_tax": 0.5, "tip_amount": 4, "tolls_amount": 0, "improvement_surcharge": 0.3, "total_amount": 13.3, "congestion_surcharge": 0, "airport_fee": 0 }
🪓 parquet cat
コマンド
Parquet ファイルのデータを取得できる(デフォルトは全件).実装としては head
も cat
も同じ.--num-records
オプションで件数を変えられたり,--columns
オプションでカラムを指定することもできる.
$ parquet cat --num-records 2 yellow_tripdata_2022-01.parquet --columns VendorID,tpep_pickup_datetime,tpep_dropoff_datetime | jq . { "VendorID": 1, "tpep_pickup_datetime": 1640997340000000, "tpep_dropoff_datetime": 1640998409000000 } { "VendorID": 1, "tpep_pickup_datetime": 1640997223000000, "tpep_dropoff_datetime": 1640997727000000 }
🪓 parquet column-size
コマンド
Parquet ファイルのカラムごとのサイズを取得できる.--columns
オプションでカラムを指定することもできる.
$ parquet column-size yellow_tripdata_2022-01.parquet DOLocationID-> Size In Bytes: 2445215 Size In Ratio: 0.06413285 RatecodeID-> Size In Bytes: 144678 Size In Ratio: 0.0037945996 fare_amount-> Size In Bytes: 2492332 Size In Ratio: 0.06536862 tpep_dropoff_datetime-> Size In Bytes: 9646502 Size In Ratio: 0.25300744 congestion_surcharge-> Size In Bytes: 214074 Size In Ratio: 0.0056147105 VendorID-> Size In Bytes: 318048 Size In Ratio: 0.00834173 passenger_count-> Size In Bytes: 488835 Size In Ratio: 0.012821114 tolls_amount-> Size In Bytes: 226301 Size In Ratio: 0.005935399 improvement_surcharge-> Size In Bytes: 26125 Size In Ratio: 6.8520376E-4 trip_distance-> Size In Bytes: 3413238 Size In Ratio: 0.08952205 store_and_fwd_flag-> Size In Bytes: 74134 Size In Ratio: 0.0019443788 payment_type-> Size In Bytes: 377921 Size In Ratio: 0.009912073 total_amount-> Size In Bytes: 3506595 Size In Ratio: 0.09197061 extra-> Size In Bytes: 540290 Size In Ratio: 0.01417067 tip_amount-> Size In Bytes: 2623199 Size In Ratio: 0.06880099 mta_tax-> Size In Bytes: 55403 Size In Ratio: 0.0014531041 airport_fee-> Size In Bytes: 163205 Size In Ratio: 0.004280524 tpep_pickup_datetime-> Size In Bytes: 9459174 Size In Ratio: 0.24809423 PULocationID-> Size In Bytes: 1912073 Size In Ratio: 0.05014965
おまけ : VS Code 拡張機能 parquet-viewer
VS Code 拡張機能 parquet-viewer を使うと,Parquet フォーマットを JSON フォーマットに変換して表示できる(Parquet ファイルが大きすぎるとうまく表示されないこともある).