前回の記事では Pandas の get_dummies() 関数を使って「カテゴリ変数」の変換（One-Hot エンコーディング）を試した．

Pandas 以外の選択肢として scikit-learn の sklearn.preprocessing モジュールを使うこともできる．今回は sklearn.preprocessing モジュールに含まれている OneHotEncoder クラスを試す．さらに関連するラベルエンコーディングとして LabelEncoder クラスも試す．

データセット 🔬

今回の検証も前回と同じく GitHub リポジトリ chendaniely/pandas_for_everyone に含まれているデータセット gapminder.tsv を使う．今回は結果をわかりやすくするために DataFrame を continent に限定しておく．また確認する値を赤と黄色で強調しておいた🖍

import sys
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

gapminder = pd.read_csv('./gapminder.tsv', delimiter='\t').loc[:, ['continent']]
gapminder

f:id:kakku22:20210501001144p:plain

`OneHotEncoder` クラス 🔬

ドキュメントを参考に OneHotEncoder クラスを試す．まず fit() 関数で DataFrame を適用すると categories_ 属性でカテゴリを確認できる．次に transform() 関数で変換すると行列になる．categories_ 属性で確認したカテゴリと一致しているため，今回は Africa と Asia が 1 になっていて「One-Hot エンコーディング」を実現できている．なお，get_feature_names() 関数を使うと Pandas の get_dummies() 関数のような属性名を取得できる．

oh_encoder = OneHotEncoder()
oh_encoder.fit(gapminder)
oh_encoder.categories_
# [array(['Africa', 'Americas', 'Asia', 'Europe', 'Oceania'], dtype=object)]

oh_encoder.transform(gapminder).toarray()
# array([[0., 0., 1., 0., 0.],
#        [0., 0., 1., 0., 0.],
#        [0., 0., 1., 0., 0.],
#        ...,
#        [1., 0., 0., 0., 0.],
#        [1., 0., 0., 0., 0.],
#        [1., 0., 0., 0., 0.]])

oh_encoder.get_feature_names(['continent'])
# array(['continent_Africa', 'continent_Americas', 'continent_Asia', 'continent_Europe', 'continent_Oceania'], dtype=object)

f:id:kakku22:20210501001911p:plain

`LabelEncoder` クラス 🔬

次は「ラベルエンコーディング」を実現する LabelEncoder クラスを試す．「ラベルエンコーディング」ではエンコードした数値でそのまま変換するため，「One-Hot エンコーディング」とは違って，値ごとにカラムが増えることはなく，値も 0 と 1 以外にも入り得る．

同じように fit() 関数で適用すると classes_ 属性でカテゴリを確認できる．配列の添字から Africa = 0 や Americas = 1 や Asia = 2 であると確認できる．そして transform() 関数で変換すると [2, 2, 2, ..., 0, 0, 0] のような配列が返ってくる．これは Asia と Africa がエンコードされている．

l_encoder = LabelEncoder()
l_encoder.fit(gapminder.continent)
l_encoder.classes_
# array(['Africa', 'Americas', 'Asia', 'Europe', 'Oceania'], dtype=object)

l_encoder.transform(gapminder.continent)
# array([2, 2, 2, ..., 0, 0, 0])

f:id:kakku22:20210501002316p:plain

まとめ 🔬

Pandas の get_dummies() 関数に関連して，今回は scikit-learn の sklearn.preprocessing モジュールを使って OneHotEncoder クラス（One-Hot エンコーディング）と LabelEncoder クラス（ラベルエンコーディング）を試した．

scikit-learn も機能が多くあるため，1歩1歩学んでいくぞー💡