Dict对象来实现文本的统计处理


In [1]:
# init a sentence including many words
sentence = "The ability to analyze data with Python is critical in data science Learn the basics and move on to create stunning visualizations"
# init a dict
word_dict = {}

# analysis the words of the sentence and statistic the number of every word
for word in sentence.split():
    if word not in word_dict:
        word_dict[word] = 1
    else:
        word_dict[word] += 1
        
# print the result
print word_dict


{'and': 1, 'on': 1, 'basics': 1, 'ability': 1, 'move': 1, 'Python': 1, 'data': 2, 'is': 1, 'Learn': 1, 'stunning': 1, 'to': 2, 'create': 1, 'critical': 1, 'in': 1, 'visualizations': 1, 'The': 1, 'analyze': 1, 'with': 1, 'the': 1, 'science': 1}

使用另外的一个defaultdict类,需要引入collections包


In [1]:
from collections import defaultdict

sentence = "The ability to analyze data with Python is critical in data science Learn the basics and move on to create stunning visualizations"

word_dict = defaultdict(int)

for word in sentence.split():
    word_dict[word] +=1
    
print word_dict


defaultdict(<type 'int'>, {'and': 1, 'on': 1, 'basics': 1, 'ability': 1, 'move': 1, 'Python': 1, 'data': 2, 'is': 1, 'Learn': 1, 'stunning': 1, 'to': 2, 'create': 1, 'critical': 1, 'in': 1, 'visualizations': 1, 'The': 1, 'analyze': 1, 'with': 1, 'the': 1, 'science': 1})

变量使用dict的items()函数,keys()可以遍历所有的键,values()可以遍历所有的值


In [ ]:
for key,value in word_dict.items():
    print key,value

counter是一个字典子类,用来统计键值对类型的对象

from collections import Counter

sentence = "The ability to analyze data with Python is critical in data science Learn the basics and move on to create stunning visualizations"

words = sentence.split()

word_dict = Counter(words)

print word_dict

使用字典的字典,举一个例子,有三个用户的五个电影的评分,则用dict如何存储,就是字典的嵌套


In [5]:
from collections import defaultdict

user_movie_rating = defaultdict(lambda:defaultdict(int))

# 初始化一个用户的电影评分

user_movie_rating['Roc-J']['wolf1'] = 4
user_movie_rating['Roc-J']['wolf2'] = 5
user_movie_rating['Roc-J']['wolf3'] = 3
user_movie_rating['Roc-J']['icecream'] = 3
user_movie_rating['Roc-J']['SW'] = 5

print user_movie_rating


defaultdict(<function <lambda> at 0x04B68D70>, {'Roc-J': defaultdict(<type 'int'>, {'icecream': 3, 'SW': 5, 'wolf1': 4, 'wolf3': 3, 'wolf2': 5})})