kakakakakku blog

Weekly Tech Blog: Keep on Learning!

parquet-cli で Parquet ファイルを読む

Parquet ファイルをサクッと読むときに parquet コマンドを使うと便利❗️

Homebrew なら brew install parquet-cli コマンドで簡単にセットアップできる.今回は最新 v1.12.3 を使う.そして parquet コマンドの実装は GitHub の apache/parquet-mr リポジトリにある.parquet help コマンドの出力だと引数まで細かく把握できず,実装を確認する必要がある.実装は parquet-cli/src/main/java/org/apache/parquet/cli/commands/ ディレクトリにある.

$ brew install parquet-cli

$ brew info parquet-cli
==> parquet-cli: stable 1.12.3 (bottled), HEAD

github.com

あと関連記事を調べると parquet-tools も紹介されていたりするけど,現時点では使えなくなっている🔥

$ brew install parquet-tools
Error: parquet-tools has been disabled because it is deprecated upstream!

formulae.brew.sh

parquet コマンド

parquet help コマンドで確認できる通り,サブコマンドは多くある.今回は代表的なものを紹介する.

  • parquet meta
  • parquet schema
  • parquet head
  • parquet cat
  • parquet column-size
$ parquet help

Usage: parquet [options] [command] [command options]

  Options:

    -v, --verbose, --debug
    Print extra debugging information

  Commands:

    help
    Retrieves details on the functions of other commands
    meta
    Print a Parquet file's metadata
    pages
  Print page summaries for a Parquet file
    dictionary
  Print dictionaries for a Parquet column
    check-stats
  Check Parquet files for corrupt page and column stats (PARQUET-251)
    schema
  Print the Avro schema for a file
    csv-schema
  Build a schema from a CSV data sample
    convert-csv
  Create a file from CSV data
    convert
  Create a Parquet file from a data file
    to-avro
  Create an Avro file from a data file
    cat
  Print the first N records from a file
    head
  Print the first N records from a file
    column-index
  Prints the column and offset indexes of a Parquet file
    column-size
  Print the column sizes of a parquet file
    prune
  Prune column(s) in a Parquet file and save it to a new file. The columns left are not changed.
    trans-compression
  Translate the compression from one to another (It doesn't support bloom filter feature yet).
    masking
    Replace columns with masked values and write to a new Parquet file
    footer
    Print the Parquet file footer in json format

  Examples:

    # print information for meta
    parquet help meta

  See 'parquet help <command>' for more information on a specific command.

Parquet サンプル

今回は TLC Trip Record Data で公開されている「タクシーデータセット🚕」を使う.Parquet フォーマットで取得できる.

  • 2022/01
  • Yellow Taxi Trip Records
  • yellow_tripdata_2022-01.parquet
    • 36 MB

www.nyc.gov

🪓 parquet meta コマンド

Parquet ファイルのメタデータを取得できる.一部は割愛して載せているけど,スキーマ情報や統計情報を確認できる.

$ parquet meta yellow_tripdata_2022-01.parquet

File path:  yellow_tripdata_2022-01.parquet
Created by: parquet-cpp-arrow version 7.0.0

(中略)        

Schema:
message schema {
  optional int64 VendorID;
  optional int64 tpep_pickup_datetime (TIMESTAMP(MICROS,false));
  optional int64 tpep_dropoff_datetime (TIMESTAMP(MICROS,false));
  optional double passenger_count;
  optional double trip_distance;
  optional double RatecodeID;
  optional binary store_and_fwd_flag (STRING);
  optional int64 PULocationID;
  optional int64 DOLocationID;
  optional int64 payment_type;
  optional double fare_amount;
  optional double extra;
  optional double mta_tax;
  optional double tip_amount;
  optional double tolls_amount;
  optional double improvement_surcharge;
  optional double total_amount;
  optional double congestion_surcharge;
  optional double airport_fee;
}


Row group 0:  count: 2463931  15.47 B records  start: 4  total(compressed): 36.361 MB total(uncompressed):63.526 MB
--------------------------------------------------------------------------------
                       type      encodings count     avg size   nulls   min / max
VendorID               INT64     G _ R     2463931   0.13 B     0       "1" / "6"
tpep_pickup_datetime   INT64     G _ R_ F  2463931   3.84 B     0       "2008-12-31T22:23:09.000000" / "2022-05-18T20:41:57.000000"
tpep_dropoff_datetime  INT64     G _ R_ F  2463931   3.92 B     0       "2008-12-31T23:06:56.000000" / "2022-05-18T20:47:45.000000"
passenger_count        DOUBLE    G _ R     2463931   0.20 B     71503   "-0.0" / "9.0"
trip_distance          DOUBLE    G _ R     2463931   1.39 B     0       "-0.0" / "306159.28"
RatecodeID             DOUBLE    G _ R     2463931   0.06 B     71503   "1.0" / "99.0"
store_and_fwd_flag     BINARY    G _ R     2463931   0.03 B     71503   "N" / "Y"
PULocationID           INT64     G _ R     2463931   0.78 B     0       "1" / "265"
DOLocationID           INT64     G _ R     2463931   0.99 B     0       "1" / "265"
payment_type           INT64     G _ R     2463931   0.15 B     0       "0" / "5"
fare_amount            DOUBLE    G _ R     2463931   1.01 B     0       "-480.0" / "401092.32"
extra                  DOUBLE    G _ R     2463931   0.22 B     0       "-4.5" / "33.5"
mta_tax                DOUBLE    G _ R     2463931   0.02 B     0       "-0.5" / "16.59"
tip_amount             DOUBLE    G _ R     2463931   1.06 B     0       "-125.22" / "888.88"
tolls_amount           DOUBLE    G _ R     2463931   0.09 B     0       "-31.4" / "193.3"
improvement_surcharge  DOUBLE    G _ R     2463931   0.01 B     0       "-0.3" / "0.3"
total_amount           DOUBLE    G _ R     2463931   1.42 B     0       "-480.3" / "401095.62"
congestion_surcharge   DOUBLE    G _ R     2463931   0.09 B     71503   "-2.5" / "2.5"
airport_fee            DOUBLE    G _ R     2463931   0.07 B     71503   "-1.25" / "1.25"

parquet.apache.org

🪓 parquet schema コマンド

Parquet ファイルの Avro スキーマ情報を取得できる.

$ parquet schema yellow_tripdata_2022-01.parquet
{
  "type" : "record",
  "name" : "schema",
  "fields" : [ {
    "name" : "VendorID",
    "type" : [ "null", "long" ],
    "default" : null
  }, {
    "name" : "tpep_pickup_datetime",
    "type" : [ "null", {
      "type" : "long",
      "logicalType" : "timestamp-micros"
    } ],
    "default" : null
  }, {
    "name" : "tpep_dropoff_datetime",
    "type" : [ "null", {
      "type" : "long",
      "logicalType" : "timestamp-micros"
    } ],
    "default" : null
  }, {
    "name" : "passenger_count",
    "type" : [ "null", "double" ],
    "default" : null
  }, {
    "name" : "trip_distance",
    "type" : [ "null", "double" ],
    "default" : null
  }, {
    "name" : "RatecodeID",
    "type" : [ "null", "double" ],
    "default" : null
  }, {
    "name" : "store_and_fwd_flag",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "PULocationID",
    "type" : [ "null", "long" ],
    "default" : null
  }, {
    "name" : "DOLocationID",
    "type" : [ "null", "long" ],
    "default" : null
  }, {
    "name" : "payment_type",
    "type" : [ "null", "long" ],
    "default" : null
  }, {
    "name" : "fare_amount",
    "type" : [ "null", "double" ],
    "default" : null
  }, {
    "name" : "extra",
    "type" : [ "null", "double" ],
    "default" : null
  }, {
    "name" : "mta_tax",
    "type" : [ "null", "double" ],
    "default" : null
  }, {
    "name" : "tip_amount",
    "type" : [ "null", "double" ],
    "default" : null
  }, {
    "name" : "tolls_amount",
    "type" : [ "null", "double" ],
    "default" : null
  }, {
    "name" : "improvement_surcharge",
    "type" : [ "null", "double" ],
    "default" : null
  }, {
    "name" : "total_amount",
    "type" : [ "null", "double" ],
    "default" : null
  }, {
    "name" : "congestion_surcharge",
    "type" : [ "null", "double" ],
    "default" : null
  }, {
    "name" : "airport_fee",
    "type" : [ "null", "double" ],
    "default" : null
  } ]
}

🪓 parquet head コマンド

Parquet ファイルのデータを取得できる(デフォルトは先頭10件).--num-records オプションで件数を変えられる.

$ parquet head --num-records 2 yellow_tripdata_2022-01.parquet | jq .
{
  "VendorID": 1,
  "tpep_pickup_datetime": 1640997340000000,
  "tpep_dropoff_datetime": 1640998409000000,
  "passenger_count": 2,
  "trip_distance": 3.8,
  "RatecodeID": 1,
  "store_and_fwd_flag": "N",
  "PULocationID": 142,
  "DOLocationID": 236,
  "payment_type": 1,
  "fare_amount": 14.5,
  "extra": 3,
  "mta_tax": 0.5,
  "tip_amount": 3.65,
  "tolls_amount": 0,
  "improvement_surcharge": 0.3,
  "total_amount": 21.95,
  "congestion_surcharge": 2.5,
  "airport_fee": 0
}
{
  "VendorID": 1,
  "tpep_pickup_datetime": 1640997223000000,
  "tpep_dropoff_datetime": 1640997727000000,
  "passenger_count": 1,
  "trip_distance": 2.1,
  "RatecodeID": 1,
  "store_and_fwd_flag": "N",
  "PULocationID": 236,
  "DOLocationID": 42,
  "payment_type": 1,
  "fare_amount": 8,
  "extra": 0.5,
  "mta_tax": 0.5,
  "tip_amount": 4,
  "tolls_amount": 0,
  "improvement_surcharge": 0.3,
  "total_amount": 13.3,
  "congestion_surcharge": 0,
  "airport_fee": 0
}

🪓 parquet cat コマンド

Parquet ファイルのデータを取得できる(デフォルトは全件).実装としては headcat も同じ.--num-records オプションで件数を変えられたり,--columns オプションでカラムを指定することもできる.

$ parquet cat --num-records 2 yellow_tripdata_2022-01.parquet --columns VendorID,tpep_pickup_datetime,tpep_dropoff_datetime | jq .
{
  "VendorID": 1,
  "tpep_pickup_datetime": 1640997340000000,
  "tpep_dropoff_datetime": 1640998409000000
}
{
  "VendorID": 1,
  "tpep_pickup_datetime": 1640997223000000,
  "tpep_dropoff_datetime": 1640997727000000
}

🪓 parquet column-size コマンド

Parquet ファイルのカラムごとのサイズを取得できる.--columns オプションでカラムを指定することもできる.

$ parquet column-size yellow_tripdata_2022-01.parquet
DOLocationID-> Size In Bytes: 2445215 Size In Ratio: 0.06413285
RatecodeID-> Size In Bytes: 144678 Size In Ratio: 0.0037945996
fare_amount-> Size In Bytes: 2492332 Size In Ratio: 0.06536862
tpep_dropoff_datetime-> Size In Bytes: 9646502 Size In Ratio: 0.25300744
congestion_surcharge-> Size In Bytes: 214074 Size In Ratio: 0.0056147105
VendorID-> Size In Bytes: 318048 Size In Ratio: 0.00834173
passenger_count-> Size In Bytes: 488835 Size In Ratio: 0.012821114
tolls_amount-> Size In Bytes: 226301 Size In Ratio: 0.005935399
improvement_surcharge-> Size In Bytes: 26125 Size In Ratio: 6.8520376E-4
trip_distance-> Size In Bytes: 3413238 Size In Ratio: 0.08952205
store_and_fwd_flag-> Size In Bytes: 74134 Size In Ratio: 0.0019443788
payment_type-> Size In Bytes: 377921 Size In Ratio: 0.009912073
total_amount-> Size In Bytes: 3506595 Size In Ratio: 0.09197061
extra-> Size In Bytes: 540290 Size In Ratio: 0.01417067
tip_amount-> Size In Bytes: 2623199 Size In Ratio: 0.06880099
mta_tax-> Size In Bytes: 55403 Size In Ratio: 0.0014531041
airport_fee-> Size In Bytes: 163205 Size In Ratio: 0.004280524
tpep_pickup_datetime-> Size In Bytes: 9459174 Size In Ratio: 0.24809423
PULocationID-> Size In Bytes: 1912073 Size In Ratio: 0.05014965

おまけ : VS Code 拡張機能 parquet-viewer

VS Code 拡張機能 parquet-viewer を使うと,Parquet フォーマットを JSON フォーマットに変換して表示できる(Parquet ファイルが大きすぎるとうまく表示されないこともある).

marketplace.visualstudio.com