文章详情页

Pandas数据类型之category的用法

浏览：122日期：2022-06-15 16:13:42

创建category使用Series创建

在创建Series的同时添加dtype='category'就可以创建好category了。category分为两部分，一部分是order，一部分是字面量：

In [1]: s = pd.Series(['a', 'b', 'c', 'a'], dtype='category')In [2]: sOut[2]: 0 a1 b2 c3 adtype: categoryCategories (3, object): [’a’, ’b’, ’c’]

可以将DF中的Series转换为category：

In [3]: df = pd.DataFrame({'A': ['a', 'b', 'c', 'a']})In [4]: df['B'] = df['A'].astype('category')In [5]: df['B']Out[32]: 0 a1 b2 c3 aName: B, dtype: categoryCategories (3, object): [a, b, c]

可以创建好一个pandas.Categorical ，将其作为参数传递给Series：

In [10]: raw_cat = pd.Categorical( ....: ['a', 'b', 'c', 'a'], categories=['b', 'c', 'd'], ordered=False ....: ) ....: In [11]: s = pd.Series(raw_cat)In [12]: sOut[12]: 0 NaN1 b2 c3 NaNdtype: categoryCategories (3, object): [’b’, ’c’, ’d’]使用DF创建

创建DataFrame的时候，也可以传入 dtype='category'：

In [17]: df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')}, dtype='category')In [18]: df.dtypesOut[18]: A categoryB categorydtype: object

DF中的A和B都是一个category:

In [19]: df['A']Out[19]: 0 a1 b2 c3 aName: A, dtype: categoryCategories (3, object): [’a’, ’b’, ’c’]In [20]: df['B']Out[20]: 0 b1 c2 c3 dName: B, dtype: categoryCategories (3, object): [’b’, ’c’, ’d’]

或者使用df.astype('category')将DF中所有的Series转换为category:

In [21]: df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})In [22]: df_cat = df.astype('category')In [23]: df_cat.dtypesOut[23]: A categoryB categorydtype: object创建控制

默认情况下传入dtype=’category’ 创建出来的category使用的是默认值：

1.Categories是从数据中推断出来的。

2.Categories是没有大小顺序的。

可以显示创建CategoricalDtype来修改上面的两个默认值：

In [26]: from pandas.api.types import CategoricalDtypeIn [27]: s = pd.Series(['a', 'b', 'c', 'a'])In [28]: cat_type = CategoricalDtype(categories=['b', 'c', 'd'], ordered=True)In [29]: s_cat = s.astype(cat_type)In [30]: s_catOut[30]: 0 NaN1 b2 c3 NaNdtype: categoryCategories (3, object): [’b’ < ’c’ < ’d’]

同样的CategoricalDtype还可以用在DF中：

In [31]: from pandas.api.types import CategoricalDtypeIn [32]: df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})In [33]: cat_type = CategoricalDtype(categories=list('abcd'), ordered=True)In [34]: df_cat = df.astype(cat_type)In [35]: df_cat['A']Out[35]: 0 a1 b2 c3 aName: A, dtype: categoryCategories (4, object): [’a’ < ’b’ < ’c’ < ’d’]In [36]: df_cat['B']Out[36]: 0 b1 c2 c3 dName: B, dtype: categoryCategories (4, object): [’a’ < ’b’ < ’c’ < ’d’]转换为原始类型

使用Series.astype(original_dtype) 或者 np.asarray(categorical)可以将Category转换为原始类型：

In [39]: s = pd.Series(['a', 'b', 'c', 'a'])In [40]: sOut[40]: 0 a1 b2 c3 adtype: objectIn [41]: s2 = s.astype('category')In [42]: s2Out[42]: 0 a1 b2 c3 adtype: categoryCategories (3, object): [’a’, ’b’, ’c’]In [43]: s2.astype(str)Out[43]: 0 a1 b2 c3 adtype: objectIn [44]: np.asarray(s2)Out[44]: array([’a’, ’b’, ’c’, ’a’], dtype=object)categories的操作获取category的属性

Categorical数据有 categories 和 ordered 两个属性。可以通过s.cat.categories 和 s.cat.ordered来获取：

In [57]: s = pd.Series(['a', 'b', 'c', 'a'], dtype='category')In [58]: s.cat.categoriesOut[58]: Index([’a’, ’b’, ’c’], dtype=’object’)In [59]: s.cat.orderedOut[59]: False

重排category的顺序：

In [60]: s = pd.Series(pd.Categorical(['a', 'b', 'c', 'a'], categories=['c', 'b', 'a']))In [61]: s.cat.categoriesOut[61]: Index([’c’, ’b’, ’a’], dtype=’object’)In [62]: s.cat.orderedOut[62]: False重命名categories

通过给s.cat.categories赋值可以重命名categories:

In [67]: s = pd.Series(['a', 'b', 'c', 'a'], dtype='category')In [68]: sOut[68]: 0 a1 b2 c3 adtype: categoryCategories (3, object): [’a’, ’b’, ’c’]In [69]: s.cat.categories = ['Group %s' % g for g in s.cat.categories]In [70]: sOut[70]: 0 Group a1 Group b2 Group c3 Group adtype: categoryCategories (3, object): [’Group a’, ’Group b’, ’Group c’]

使用rename_categories可以达到同样的效果：

In [71]: s = s.cat.rename_categories([1, 2, 3])In [72]: sOut[72]: 0 11 22 33 1dtype: categoryCategories (3, int64): [1, 2, 3]

或者使用字典对象：

# You can also pass a dict-like object to map the renamingIn [73]: s = s.cat.rename_categories({1: 'x', 2: 'y', 3: 'z'})In [74]: sOut[74]: 0 x1 y2 z3 xdtype: categoryCategories (3, object): [’x’, ’y’, ’z’]使用add_categories添加category

可以使用add_categories来添加category:

In [77]: s = s.cat.add_categories([4])In [78]: s.cat.categoriesOut[78]: Index([’x’, ’y’, ’z’, 4], dtype=’object’)In [79]: sOut[79]: 0 x1 y2 z3 xdtype: categoryCategories (4, object): [’x’, ’y’, ’z’, 4]使用remove_categories删除category

In [80]: s = s.cat.remove_categories([4])In [81]: sOut[81]: 0 x1 y2 z3 xdtype: categoryCategories (3, object): [’x’, ’y’, ’z’]删除未使用的cagtegory

In [82]: s = pd.Series(pd.Categorical(['a', 'b', 'a'], categories=['a', 'b', 'c', 'd']))In [83]: sOut[83]: 0 a1 b2 adtype: categoryCategories (4, object): [’a’, ’b’, ’c’, ’d’]In [84]: s.cat.remove_unused_categories()Out[84]: 0 a1 b2 adtype: categoryCategories (2, object): [’a’, ’b’]重置cagtegory

使用set_categories()可以同时进行添加和删除category操作：

In [85]: s = pd.Series(['one', 'two', 'four', '-'], dtype='category')In [86]: sOut[86]: 0 one1 two2 four3 -dtype: categoryCategories (4, object): [’-’, ’four’, ’one’, ’two’]In [87]: s = s.cat.set_categories(['one', 'two', 'three', 'four'])In [88]: sOut[88]: 0 one1 two2 four3 NaNdtype: categoryCategories (4, object): [’one’, ’two’, ’three’, ’four’]category排序

如果category创建的时候带有 ordered=True ，那么可以对其进行排序操作：

In [91]: s = pd.Series(['a', 'b', 'c', 'a']).astype(CategoricalDtype(ordered=True))In [92]: s.sort_values(inplace=True)In [93]: sOut[93]: 0 a3 a1 b2 cdtype: categoryCategories (3, object): [’a’ < ’b’ < ’c’]In [94]: s.min(), s.max()Out[94]: (’a’, ’c’)

可以使用 as_ordered() 或者 as_unordered() 来强制排序或者不排序：

In [95]: s.cat.as_ordered()Out[95]: 0 a3 a1 b2 cdtype: categoryCategories (3, object): [’a’ < ’b’ < ’c’]In [96]: s.cat.as_unordered()Out[96]: 0 a3 a1 b2 cdtype: categoryCategories (3, object): [’a’, ’b’, ’c’]重排序

使用Categorical.reorder_categories() 可以对现有的category进行重排序：

In [103]: s = pd.Series([1, 2, 3, 1], dtype='category')In [104]: s = s.cat.reorder_categories([2, 3, 1], ordered=True)In [105]: sOut[105]: 0 11 22 33 1dtype: categoryCategories (3, int64): [2 < 3 < 1]多列排序

sort_values 支持多列进行排序：

In [109]: dfs = pd.DataFrame( .....: { .....: 'A': pd.Categorical( .....: list('bbeebbaa'), .....: categories=['e', 'a', 'b'], .....: ordered=True, .....: ), .....: 'B': [1, 2, 1, 2, 2, 1, 2, 1], .....: } .....: ) .....: In [110]: dfs.sort_values(by=['A', 'B'])Out[110]: A B2 e 13 e 27 a 16 a 20 b 15 b 11 b 24 b 2比较操作

如果创建的时候设置了ordered==True ，那么category之间就可以进行比较操作。支持 ==, !=, >, >=, <, 和 <=这些操作符。

In [113]: cat = pd.Series([1, 2, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True))In [114]: cat_base = pd.Series([2, 2, 2]).astype(CategoricalDtype([3, 2, 1], ordered=True))In [115]: cat_base2 = pd.Series([2, 2, 2]).astype(CategoricalDtype(ordered=True))In [119]: cat > cat_baseOut[119]: 0 True1 False2 Falsedtype: boolIn [120]: cat > 2Out[120]: 0 True1 False2 Falsedtype: bool其他操作

Cagetory本质上来说还是一个Series，所以Series的操作category基本上都可以使用，比如： Series.min(), Series.max() 和 Series.mode()。

value_counts：

In [131]: s = pd.Series(pd.Categorical(['a', 'b', 'c', 'c'], categories=['c', 'a', 'b', 'd']))In [132]: s.value_counts()Out[132]: c 2a 1b 1d 0dtype: int64

DataFrame.sum()：

In [133]: columns = pd.Categorical( .....: ['One', 'One', 'Two'], categories=['One', 'Two', 'Three'], ordered=True .....: ) .....: In [134]: df = pd.DataFrame( .....: data=[[1, 2, 3], [4, 5, 6]], .....: columns=pd.MultiIndex.from_arrays([['A', 'B', 'B'], columns]), .....: ) .....: In [135]: df.sum(axis=1, level=1)Out[135]: One Two Three0 3 3 01 9 6 0

Groupby：

In [136]: cats = pd.Categorical( .....: ['a', 'b', 'b', 'b', 'c', 'c', 'c'], categories=['a', 'b', 'c', 'd'] .....: ) .....: In [137]: df = pd.DataFrame({'cats': cats, 'values': [1, 2, 2, 2, 3, 4, 5]})In [138]: df.groupby('cats').mean()Out[138]: valuescatsa1.0b2.0c4.0dNaNIn [139]: cats2 = pd.Categorical(['a', 'a', 'b', 'b'], categories=['a', 'b', 'c'])In [140]: df2 = pd.DataFrame( .....: { .....: 'cats': cats2, .....: 'B': ['c', 'd', 'c', 'd'], .....: 'values': [1, 2, 3, 4], .....: } .....: ) .....: In [141]: df2.groupby(['cats', 'B']).mean()Out[141]: valuescats Ba c 1.0 d 2.0b c 3.0 d 4.0c c NaN d NaN

Pivot tables：

In [142]: raw_cat = pd.Categorical(['a', 'a', 'b', 'b'], categories=['a', 'b', 'c'])In [143]: df = pd.DataFrame({'A': raw_cat, 'B': ['c', 'd', 'c', 'd'], 'values': [1, 2, 3, 4]})In [144]: pd.pivot_table(df, values='values', index=['A', 'B'])Out[144]: valuesA Ba c 1 d 2b c 3 d 4

到此这篇关于Pandas数据类型之category的用法的文章就介绍到这了,更多相关category的用法内容请搜索好吧啦网以前的文章或继续浏览下面的相关文章希望大家以后多多支持好吧啦网！

上一条：python geopandas读取、创建shapefile文件的方法下一条：Python爬虫框架之Scrapy中Spider的用法

相关文章：

1. python GUI库图形界面开发之PyQt5动态(可拖动控件大小)布局控件QSplitter详细使用方法与实例2. CSS3实例分享之多重背景的实现(Multiple backgrounds)3. js开发中的页面、屏幕、浏览器的位置原理（高度宽度）说明讲解（附图）4. CSS清除浮动方法汇总5. 不要在HTML中滥用div6. XML入门的常见问题(三)7. Python数据分析JupyterNotebook3魔法命令详解及示例8. 父div高度不能自适应子div高度的解决方案9. ASP动态include文件10. vue跳转页面常用的几种方法汇总

排行榜

					
					python GUI库图形界面开发之PyQt5动态(可拖动控件大小)布局控件QSplitter详细使用方法与实例
java语言实现猜数字游戏
springboot使JUL实现日志管理功能
Android实现动态改变shape.xml中图形的颜色
python 基于卡方值分箱算法的实现示例
IDEA下lombok安装及找不到get,set的问题的解决方法
python实现web邮箱扫描的示例(附源码)
JAVA中String介绍及常见面试题小结
python GUI库图形界面开发之PyQt5滑块条控件QSlider详细使用方法与实例
python开发实例之Python的Twisted框架中Deferred对象的详细用法与实例
python使用ctypes库调用DLL动态链接库
				

热门标签