SQL で JSON ファイルを検索したいなー!でも大袈裟な分析環境は構築したくないなー!って思うときってあるよね❓きっとあるはず〜あるある!今回はお手軽に JSON ファイルを SQL で検索できる「Columnq」を紹介する❗️Columnq は JSON 以外に CSV や Parquet もサポートしている.
インストール
pip で簡単にインストールできる.今回は Columnq 0.3.0 を使う.
$ pip install columnq-cli $ columnq --version Columnq 0.3.0
データセット
今回は JSON ファイルのサンプルデータセットとして,GitHub のエクスポートデータを使う.詳しくは以下の記事を参照してもらえればと!
Columnq : 基本操作
🔎 SELECT ① : WHERE
columnq
コマンドでは --table
オプションに JSON ファイルを指定すると,SQL でテーブル名として使えるようになる.以下の SQL は GitHub のリポジトリ情報を含んだ repositories_000004.json
に対して SELECT
文を実行している.
$ columnq sql --table repositories_000004.json \ 'SELECT name, owner, default_branch, has_issues, general_settings.projects AS projects, created_at FROM repositories_000004 WHERE private = false AND is_archived = false LIMIT 5' +--------------------------------------------+--------------------------------+----------------+------------+----------+----------------------+ | name | owner | default_branch | has_issues | projects | created_at | +--------------------------------------------+--------------------------------+----------------+------------+----------+----------------------+ | Kubernetes-and-Cloud-Native-Associate-KCNA | https://github.com/kakakakakku | main | false | true | 2022-05-31T05:04:38Z | | pluto | https://github.com/kakakakakku | master | false | true | 2022-07-18T14:26:25Z | | killercoda-kakakakakku | https://github.com/kakakakakku | master | true | false | 2022-07-27T08:31:50Z | | sandbox-playwright-pytest-tracing | https://github.com/kakakakakku | master | false | true | 2022-11-30T04:09:18Z | +--------------------------------------------+--------------------------------+----------------+------------+----------+----------------------+
🔎 SELECT ② : GROUP BY
以下の SQL は GitHub のリポジトリ情報を含んだ repositories_000003.json
に対して SELECT
文を実行している.GROUP BY
もちゃんと使える.
$ columnq sql --table repositories_000003.json \ 'SELECT private, COUNT(*) AS count FROM repositories_000003 GROUP BY private' +---------+-------+ | private | count | +---------+-------+ | true | 69 | | false | 31 | +---------+-------+
🔎 SELECT ③ : JOIN
以下の SQL は GitHub のリポジトリ情報を含んだ repositories_000003.json
と issue 情報を含んだ issues_000001.json
に対して SELECT
文を実行している.複数の JSON ファイルに JOIN
を使えるのは本当に便利❗️
$ columnq sql --table repositories_000003.json --table issues_000001.json \ 'SELECT name, issues_000001.url, private FROM repositories_000003 JOIN issues_000001 ON repositories_000003.url = issues_000001.repository' +----------------------+--------------------------------------------------------------+---------+ | name | url | private | +----------------------+--------------------------------------------------------------+---------+ | tmp | https://github.com/kakakakakku/tmp/issues/1 | true | | TIL | https://github.com/kakakakakku/TIL/issues/1 | true | | TIL | https://github.com/kakakakakku/TIL/issues/2 | true | | TIL | https://github.com/kakakakakku/TIL/issues/3 | true | | TIL | https://github.com/kakakakakku/TIL/issues/4 | true | | TIL | https://github.com/kakakakakku/TIL/issues/5 | true | | TIL | https://github.com/kakakakakku/TIL/issues/6 | true | | TIL | https://github.com/kakakakakku/TIL/issues/7 | true | | TIL | https://github.com/kakakakakku/TIL/issues/8 | true | | hello-github-actions | https://github.com/kakakakakku/hello-github-actions/issues/1 | true | | hello-github-actions | https://github.com/kakakakakku/hello-github-actions/issues/3 | true | +----------------------+--------------------------------------------------------------+---------+
Columnq : 複数ファイルを横断して検索する
columnq
コマンドでは --table
オプションに JSON ファイルではなく「標準入力 (stdin)」を指定することもできる.例えば GitHub のリポジトリ情報を含んだ JSON ファイルが repositories_000001.json
から repositories_000004.json
に分割されている場合は cat
コマンドや jq
コマンドを使ってファイル結合をして,標準入力 (stdin) に対して SELECT
文を実行できる.
$ jq -s add repositories_*.json | \ columnq sql --table 't=stdin,format=json' \ 'SELECT private, COUNT(*) AS count FROM t GROUP BY private' +---------+-------+ | private | count | +---------+-------+ | true | 114 | | false | 199 | +---------+-------+
Columnq : インタラクティブモード
columnq console
コマンドを使うとインタラクティブに SQL を実行できる.
$ columnq console --table repositories_000004.json columnq(sql)> SHOW TABLES; +---------------+--------------------+---------------------+------------+ | table_catalog | table_schema | table_name | table_type | +---------------+--------------------+---------------------+------------+ | datafusion | public | repositories_000004 | BASE TABLE | | datafusion | information_schema | tables | VIEW | | datafusion | information_schema | views | VIEW | | datafusion | information_schema | columns | VIEW | +---------------+--------------------+---------------------+------------+ columnq(sql)> SHOW COLUMNS FROM repositories_000004; +---------------+--------------+---------------------+--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+ | table_catalog | table_schema | table_name | column_name | data_type | is_nullable | +---------------+--------------+---------------------+--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+ | datafusion | public | repositories_000004 | actions_general_settings | Struct([Field { name: "actions_disabled", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "allows_all_actions", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "allows_github_owned_actions", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "allows_local_actions_only", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "allows_specific_actions_patterns", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "allows_verified_actions", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "patterns", data_type: List(Field { name: "item", data_type: Null, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }]) | YES | | datafusion | public | repositories_000004 | autolinks | List(Field { name: "item", data_type: Null, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }) | YES | | datafusion | public | repositories_000004 | collaborators | List(Field { name: "item", data_type: Struct([Field { name: "permission", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "user", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }) | YES | | datafusion | public | repositories_000004 | created_at | Utf8 | YES | | datafusion | public | repositories_000004 | default_branch | Utf8 | YES | | datafusion | public | repositories_000004 | description | Utf8 | YES | | datafusion | public | repositories_000004 | general_settings | Struct([Field { name: "allow_forking", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "auto_merge", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "delete_branch_heads", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "discussions", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "git_lfs_in_archives", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "merge_commit", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "projects", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "rebase_merge", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "sponsorships", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "squash_merge", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "template", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "update_branch", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }]) | YES | | datafusion | public | repositories_000004 | git_url | Utf8 | YES | | datafusion | public | repositories_000004 | has_downloads | Boolean | YES | | datafusion | public | repositories_000004 | has_issues | Boolean | YES | | datafusion | public | repositories_000004 | has_wiki | Boolean | YES | | datafusion | public | repositories_000004 | is_archived | Boolean | YES | | datafusion | public | repositories_000004 | labels | List(Field { name: "item", data_type: Struct([Field { name: "color", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "created_at", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "description", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "name", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "url", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }) | YES | | datafusion | public | repositories_000004 | name | Utf8 | YES | | datafusion | public | repositories_000004 | owner | Utf8 | YES | | datafusion | public | repositories_000004 | private | Boolean | YES | | datafusion | public | repositories_000004 | public_keys | List(Field { name: "item", data_type: Struct([Field { name: "created_at", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "fingerprint", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "key", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "read_only", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "title", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }) | YES | | datafusion | public | repositories_000004 | repository_topics | List(Field { name: "item", data_type: Null, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }) | YES | | datafusion | public | repositories_000004 | security_and_analysis | Struct([Field { name: "advanced_security", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "dependency_graph", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "token_scanning", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "token_scanning_push_protection", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "vulnerability_alerts", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "vulnerability_updates", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }]) | YES | | datafusion | public | repositories_000004 | type | Utf8 | YES | | datafusion | public | repositories_000004 | url | Utf8 | YES | | datafusion | public | repositories_000004 | webhooks | List(Field { name: "item", data_type: Struct([Field { name: "active", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "content_type", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "enable_ssl_verification", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "event_types", data_type: List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "payload_url", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }) | YES | | datafusion | public | repositories_000004 | website | Utf8 | YES | +---------------+--------------+---------------------+--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+ columnq(sql)> SELECT name, owner, default_branch, has_issues, general_settings.projects AS projects, created_at FROM repositories_000004 WHERE private = false AND is_archived = false LIMIT 5; +--------------------------------------------+--------------------------------+----------------+------------+----------+----------------------+ | name | owner | default_branch | has_issues | projects | created_at | +--------------------------------------------+--------------------------------+----------------+------------+----------+----------------------+ | Kubernetes-and-Cloud-Native-Associate-KCNA | https://github.com/kakakakakku | main | false | true | 2022-05-31T05:04:38Z | | pluto | https://github.com/kakakakakku | master | false | true | 2022-07-18T14:26:25Z | | killercoda-kakakakakku | https://github.com/kakakakakku | master | true | false | 2022-07-27T08:31:50Z | | sandbox-playwright-pytest-tracing | https://github.com/kakakakakku | master | false | true | 2022-11-30T04:09:18Z | +--------------------------------------------+--------------------------------+----------------+------------+----------+----------------------+
まとめ
SQL で JSON ファイルをお手軽に検索するなら Columnq が便利❗️