pythonによるデータ分析入門を写経していく（その２）

pythonによるデータ分析入門を写経していってます。 pythonすごい便利。これはRからpythonに乗り換えたいって思ってきました。とりあえずP２５まで。進みが遅い・・・。

この本

リスト内包のif

リスト内包にifで条件をつけることができる

下記の例は'tz'を含むもののみを抽出している

  In [73]: time_zone = [rec['tz'] for rec in records if 'tz' in rec]

  In [74]: time_zone[:10]
  Out[74]:
  ['America/New_York',
   'America/Denver',
   'America/New_York',
   'America/Sao_Paulo',
   'America/New_York',
   'America/New_York',
   'Europe/Warsaw',
   '',
   '',
   '']

defaultdictによるキーペアをリスト形式への変換

map/reduceのreduceの処理みたいなものという理解
int関数で0を渡すことで、カウントを容易に実装する

参考：http://docs.python.jp/3.3/library/collections.html#collections.defaultdict

  from collections import defaultdict
  def get_counts(sequence):
      counts = defaultdict(int)
      for x in sequence:
          counts[x] += 1
      return counts

  In [69]: counts = get_counts(time_zone)
  In [75]: counts['America/New_York']
  Out[75]: 1251

上記のTop10を抽出(ソートを自前実装)

defaultdict.items()でキーバリューのペアを出力

list.sort() は昇順にソート

今回は使ってないが、Keyを指定することで、ソートの前の処理を書くことができる　Ex:key=str.lower

  In [76]: def top_counts(count_dict,n):
     ....:     value_key_pairs = [(count,tz) for tz ,count in count_dict.items()]
     ....:     value_key_pairs.sort()
     ....:     return value_key_pairs[-n:]
     ....:

  In [78]: top_counts(counts,10)
  Out[78]:
  [(33, 'America/Sao_Paulo'),
   (35, 'Europe/Madrid'),
   (36, 'Pacific/Honolulu'),
   (37, 'Asia/Tokyo'),
   (74, 'Europe/London'),
   (191, 'America/Denver'),
   (382, 'America/Los_Angeles'),
   (400, 'America/Chicago'),
   (521, ''),
   (1251, 'America/New_York')]

上記のTop10を抽出(Collections.Counterを使う)

Collections.Counterはハッシュ可能なオブジェクトをカウントする
- ハッシュ可能なら、下記のようにすることも可能
```
  In [87]: Counter('gallahad')
  Out[87]: Counter({'a': 3, 'l': 2, 'd': 1, 'h': 1, 'g': 1})
```
counter.most_commonはカウントのTOPnを返す
- http://docs.python.jp/3.3/library/collections.html#collections.Counter.most_common
```
  In [84]: from collections import Counter
  In [85]: counts =  Counter(time_zone)
  In [86]: counts.most_common(10)
```

pandasを使う

参考：データ分析ライブラリPandasの使い方（@iktakahiro さん）

Rのデータフレーム型のようなもの

  from pandas import DataFrame,Series
  import pandas as pd
  import numpy as np
  frame = DataFrame(records)
  frame


  _heartbeat_                                                  a  \
  0           NaN  Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...
  1           NaN                             GoogleMaps/RochesterNY
  2           NaN  Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ...
  3           NaN  Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...
  (省略)

pandasのデータアクセス

dataframe['ラベル']とかdataframe['ラベル'][行数]とかでアクセス可能

      In [102]: frame['tz'][:10]
      Out[102]:
      0     America/New_York
      1       America/Denver
      2     America/New_York
      3    America/Sao_Paulo
      4     America/New_York
      5     America/New_York
      6        Europe/Warsaw
      7
      8
      9
      Name: tz, dtype: object

dataframeから['ラベル名']で一列を取得できる

Seriesオブジェクトとして、取得されたデータが返される

Seriesオブジェクトにはvalue_counts()というメソッドがある

Seriesオブジェクトの

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html

      In [117]: tz_counts = frame['tz'].value_counts()

      In [118]: tz_counts
      Out[118]:
      America/New_York       1251
                              521
      America/Chicago         400
      America/Los_Angeles     382
      America/Denver          191
      Europe/London            74
      Asia/Tokyo               37
      Pacific/Honolulu         36
      Europe/Madrid            35
      America/Sao_Paulo        33
      Europe/Berlin            28
      Europe/Rome              27
      America/Rainy_River      25
      Europe/Amsterdam         22
      America/Phoenix          20
      ...

データクレンジング

NA/NaNの置き換え

pandas.DataFrame.fillna()メソッドをつかうことで、NA/NaNの値を置き換える
- pandas.DataFrame.fillna(hoge.mean())とか
- http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.DataFrame.fillna.html
```
  In [119]: clean_tz = frame['tz'].fillna('MissingData')
```

空白文字を置き換え

clean_tz[clean_tz == ''] = 'Unknown'

matplotlibでグラフ化

pandas.dataframe.plot()でプロット（matplotlibに依存）

http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.DataFrame.plot.html

  In [127]: tz_count = clean_tz.value_counts()

  In [128]: tz_count[:10]
  Out[128]:
  America/New_York       1251
  Unknown                 521
  America/Chicago         400
  America/Los_Angeles     382
  America/Denver          191
  MissingData             120
  Europe/London            74
  Asia/Tokyo               37
  Pacific/Honolulu         36
  Europe/Madrid            35
  dtype: int64

  In [129]: tz_count[:10].plot(kind = 'barh',rot =0)
  Out[129]: <matplotlib.axes.AxesSubplot at 0x10945ba50>

f:id:sleeping_micchi:20140408235533p:plain

データクレンジング2

リスト内包と正規表現

欠損値をのぞいたレコードの抽出

dataframe.dropna()

欠損値をのぞいた全件データを返す

http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.DataFrame.dropna.html

  In [131]: results = Series([x.split()[0] for x in frame.a.dropna()])

  In [132]: results[:5]
  Out[132]:
  0               Mozilla/5.0
  1    GoogleMaps/RochesterNY
  2               Mozilla/4.0
  3               Mozilla/5.0
  4               Mozilla/5.0
  dtype: object

dataframe.notnull()メソッドは欠損値かどうかを返す
```
      In [133]: frame.a.notnull()
      Out[133]:
      0      True
      1      True
      2      True
      3      True
```
- In [136]: cframe = frame[frame.a.notnull()]
  - データフレームの['a']列が欠損値でないものを抽出している
str.contains()はSeriesオブジェクトに対して、実行して、引数の文字列が含まれるかどうかを判定する
- ケースセンシティブなので、注意

http://pandas.pydata.org/pandas-docs/stable/basics.html#testing-for-strings-that-match-or-contain-a-pattern

  In [159]: cframe['a'].str.contains('Windows'),'Windows','notWindows'
  Out[159]:
  (0      True
  1     False
  2      True
  3     False
  4      True
  5      True
  6      True
  7      True

numpyのWhereで絞り込み

import numpy as np の後に、np.where()でメソッド呼び出しする
- 引数が一つの場合、条件に合致するもののみ抽出
- 引数を３つあたえ場合、条件に一致・不一致でそれぞれの値をつめた配列を返す
- http://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html

どらちゃんのポッケ

R・統計・技術メモなど勉強ログ置き場