使用 NLTK 搭配 Twitter API 拿取社群資料：以川普的 Twitter資料為例

22 min readOct 14, 2018

本篇使用 NLTK 拿取 Twitter 資料，主要分為三塊：環境設定、如何從 Twitter API 獲取資料（結構化與非結構化）、實作如何獲取川普的 Twitter po 文資料。詳細也可以參考英文版官方文件 Twitter HOWTO。

NLTK 是什麼？尚未安裝 NLTK？

NLTK 全名是 Natural Language Tool Kit，是一套基於 Python 的自然語言處理工具箱。前一系列介紹過 NLTK，安裝方法的傳送門在此：NLTK 初學指南(一)：簡單易上手的自然語言工具箱 — 探索篇。

環境設定

1. 新增 Twitter 應用程式

由於需要從 Twitter 透過 API 拿取資料，第一步需要新增一個專屬的應用程式。透過以下網址( https://apps.twitter.com ) 輸入自己的 Twitter 帳號密碼登入後，便可以找到新增應用程式的畫面。

2. 儲存 API keys 資訊

此步驟為設定呼叫 Twitter API 時的請求端，新增一個資料夾 twitter-files (可自行命名)，將 API key 的資訊儲存在 credentials.txt ，內容包含：

app_key = YOUR CONSUMER KEY  
app_secret = YOUR CONSUMER SECRET  
oauth_token = YOUR ACCESS TOKEN  
oauth_token_secret = YOUR ACCESS TOKEN SECRET

上述輸入 key 的資訊可以在點開你的應用程式後找到，輸入內容不須特別加引號，直接輸在等號右邊的位置即可：

3. 環境變數設定

以我使用的 zsh shell 為例，在 terminal 輸入 open ~/.zshrc 後，即可編輯 zsh 文件，並將這一行貼上即可；若使用 bash，輸入 open ~/.bashrc 即可編輯。

export TWITTER='/path/to/your/twitter-files'  # 輸入前述 twitter-files 資料夾位置

如果執行的時候出現以下 error：

Supply a value to the ‘subdir’ parameter or set the TWITTER environment variable.

是因為 shell 文件設定的東西還沒更新讓 jupyter notebook 知道，只需要重啟 jupyter notebook，再用 %env 查看：有出現以下就是成功了！

‘TWITTER’: ‘/Users/youngmihuang/twitter-files’ # twitter-files 位置

4. 安裝 twython

twython 是一個第三方的API，主要是 Python 對 Twitter API 的封裝，後面介紹的 NLTK twitter package 主要會依賴於此，關係示意如下：

使用 pip install twython 或 easy_install twython 安裝 twython 後，即可開始準備使用 NLTK 實作呼叫 Twitter API 囉。

如何獲取 Twitter 資料

透過 nltk.twitter 獲取的資料流，僅佔所有公開 tweets 的 1%，語言以及內容都是隨機的，也就是說，每次執行獲取的資料都是不同的。

1. 搜尋 tweets 包含特定字詞

tw.tweets( keywords =['word1, word2, word3, …'] )

# 搜尋 tweets 可以包含多個字詞 (逗號代表: 'or' 的意義)
from nltk.twitter import Twitter
tw = Twitter()
tw.tweets(keywords='love, hate', limit=10) # 只取 10 筆資料# Result (以 RT 開頭分隔不同則 tweets)
RT @CollinRugg: People are speculating that Elizabeth Warren might take on Trump in 2020.RT @Harry_Styles: Wow, Eight years has passed. Thank you for all the love, thank you for all the support. Thank you for everything. 
I love…
RT @lilbaked: i love drinking alcohol i didn’t pay for
RT @louisvlifestyle: Hate when someone doesn’t keep their word
RT @_18RIMA: this is the cutest vid i’ve ever seen oh my fucking good i love troye so fucking much https://t.co/DuKLahuzB9
RT @omgDebbie: The McCanns killed Maddie https://t.co/z6tzndJvfx
RT @Srija_ma: who will watch I will watch my whole family loved promo Eri I was only one fan now my mom also became her fans she said the g…
RT @bangstanmutuals: rt if you love @BTS_twt 

- jungkook
- jimin
- taehyung
- seokjin
- namjoon
- yoongi
- hoseok

follow whoever rts 💋
RT @DFBHarvard: OK!

It's Time for me to tell @FoxNews how much I hate Shepard Smith. 

Especially at $8 million/year.

Anyone care to join…
RT @atiqah98: If you love someone, just let it be. If she/he comes back, it's yours. If it doesn't come back, it was never meant to be. Mov…
RT @Harry_Styles: Wow, Eight years has passed. Thank you for all the love, thank you for all the support. Thank you for everything. 
I love…
Written 10 Tweets

2. 搜尋特定專頁的 tweets

tw.tweets( follow=['TwitterID'] )

首先需要知道 TwitterID，API 才能根據此參數獲得資料。 TwitterID 是透過用戶名稱 (username) 轉換而成的，Twitter 提供了 Twitter ID and username converter 可以查詢，在 Twitter 首頁輸入有興趣的 twitter 用戶名稱後，可以在兩個地方找到：

經過用戶名稱轉換後的 TwitterID：

royalfamily => 36042554      # 英國皇室
antogriezmann => 950341134   # 法國足球員背號7號 Antonie Griezmann
icmlconf => 387156826        # ICML (國際機器學習頂會)
cnn => 759251                # CNN

Source: Tweeter ID and username converter

由於 nltk.twitter 獲取的資料流僅佔所有公開 tweets 的 1% 的特性，發文數、用戶回覆數、轉發數量高的專頁，因資料量較多，從 API 獲取資料的速度較快。經實測之後，也的確如此，CNN 獲取的資料速度比其他專頁快非常多。

# 使用 CNN 的 TwitterID:759251 搜尋
tw = Twitter()
tw.tweets(follow=['759251'], limit=10) # see what CNN is talking about# Result
@CNN 2020 race is already over. Trump won.
@CNN Really and how many votes will they get for that gift lol
RT @CNN: The White House banned Kaitlan Collins, a White House reporter for CNN, from a press event after Collins asked President Trump que…
RT @CNN: Misty Copeland, Gigi Hadid to star in 2019 Pirelli calendar https://t.co/wuFzWN3MR7 via @CNNStyle https://t.co/ou3rWP7LD7
RT @CNN: A number of potential 2020 Democratic presidential contenders are behind a bill that would allow Puerto Rico to terminate its $73…
@CNN Hack #KaitlinCollins kicked out! https://t.co/sOJRALR5CC
RT @CNN: JPMorgan Chase, Wells Fargo and other banks have come under fire for helping to finance private prisons used by the federal govern…
RT @CNN: A number of potential 2020 Democratic presidential contenders are behind a bill that would allow Puerto Rico to terminate its $73…
@CNN At they going to terminate every US citizens debt too?? They never learn
RT @CNN: After years of research, scientists have confirmed that liquid water exists on Mars. This could allow humans to further explore th…

API 獲取 Twitter 資料的使用方法

接下來就會使用到環境設定所儲存的 credentials.txt 去 call API，主要會使用 Streamer ( 背後是 Streaming API )、Query ( 背後是 Search API、歷史資料 )。

在資料品質的層面：Twitter 開放的 API ，在資料獲取上有分 standard、premium、enterprise 三種方案。以 standard 來說，Streaming API 只能獲取 1% 的資料；Search API 的資料區間為過去 7 天、非完整資料，較適合應用在輕量級的分析上。若試用 standard 後覺得不錯有大規模應用的資料需求，就可以參考 premium、enterprise 方案。

了解更多：Twitter API 開發者文件、Streaming API and Search API 的差異

Streamer

register() 的參數設定： TweetViewer 為瀏覽資料、 TweetWriter 為寫入檔案，limit 是設定 tweets 資料的則數；取資料可以透過 sample() 或是 filter()。經過實測， Streamer 能夠抓到近 30 分鐘內剛發生的消息。

from nltk.twitter import Query, Streamer, Twitter, TweetViewer, TweetWriter, credsfromfile
oauth = credsfromfile()                 # 會搜尋 credentials.txt(預設)# Streamer
client = Streamer(**oauth)              # 接近即時的資料
client.register(TweetViewer(limit=3 ))  # 設定使用瀏覽or寫入檔案 
client.sample()                         # 取資料: 隨機抽樣
client.filter(track='machine learning') # 取資料: 篩選特定字詞# Result 
RT @AWS_HongKong: Cool demo time! Olivier Klein, Head of Emerging Technologies, APAC, AWS Solutions Architecture is now going to show us so…
RT @walmyrcarvalho: if (isMouseMoving) stopUpdates()

MACHINE LEARNING https://t.co/rp4cZxtZFY
RT @rautsan: Top story: How one medical group uses AI, machine learning to improve value-based care | Healthcare IT News https://t.co/brymC…
Written 3 Tweets

Query

接下來是 Query 的使用方法，Twitter 回傳的資料是 .json 檔，再轉成 python 可讀取的 dictionary 格式，這是國外大神整理好的視覺化圖表（了解更多： API spec 文件）：

search_tweets() 是一個產生器，參數能設定：特定字詞搜尋 (e.g. keywords = 'machine learning' )；next ( tweets ) 是從產生器當中拿取第一則 tweets：

# Query
client = Query(**oauth)                                  # 歷史資料
tweets = client.search_tweets(keywords='machine learning', limit=10) 
tweet = next(tweets)                                     # 取資料
from pprint import pprint
pprint(tweet, depth=1)# Result (拿取第一則 tweets 結果)
{'contributors': None,
 'coordinates': None,
 'created_at': 'Thu Jul 26 05:07:59 +0000 2018',
 'entities': {...},
 'favorite_count': 0,
 'favorited': False,
 'geo': None,
 'id': 1022347777823657984,
 'id_str': '1022347777823657984',
 'in_reply_to_screen_name': None,
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'is_quote_status': False,
 'lang': 'en',
 'metadata': {...},
 'place': None,
 'possibly_sensitive': False,
 'retweet_count': 7,
 'retweeted': False,
 'retweeted_status': {...},
 'source': '<a href="http://twitter.com/download/iphone" '
           'rel="nofollow">Twitter for iPhone</a>',
 'text': 'RT @deryck_jeremy: Enjoyed talking #AI in #Ag with @TimHammerich and '
         'sharing advice for budding #agtech entrepreneurs. AI is transforming '
         't…',
 'truncated': False,
 'user': {...}}

上述的結構化資料，可以依照需求取所需欄位做應用，例如計算：Twitter 專頁的追蹤者與好友人數，透過 user object 裡的 screen_name、followers_count、friends_count ，即可得到：

# 粉絲追蹤者數、正在追蹤的專頁數
userids = ['36042554', '950341134', '387156826', '759251', '612473']
client = Query(**oauth)
user_info = client.user_info_from_id(userids)
for info in user_info:
    name = info['screen_name']
    followers = info['followers_count']
    following = info['friends_count']
    print('{}, followers: {}, following: {}'.format(name, followers, following))# Result
RoyalFamily, followers: 3772976, following: 522 
AntoGriezmann, followers: 5480922, following: 3
icmlconf, followers: 11054, following: 5
CNN, followers: 40001853, following: 1115
BBCNews, followers: 9412163, following: 99
realDonaldTrump, followers: 53285875, following: 47

上述結果轉換成 pandas dataframe 後，就可以 seaborn 做視覺化呈現：

程式碼參考：

小實作：獲取川普在 Twitter 的發文

由於美國總統 Donald J. Trump 常在 Twitter 發表各式各樣的 tweets，有時甚至會搶先新聞稿 po 文，不管是川普本人的 tweets 或是轉發、回覆有提及川普的社群資料，皆是可以做情緒分析的題材。由於 Streamer 是抓取 1% 即時的資料，若以關鍵字搜尋有提及川普的 tweets ，所獲取的資料則數不穩定，故以下使用 Query 爬取川普本人過去七天發表的 Twitter 資料為例：

寫入

API 會以 .json 檔的格式寫入資料，並提供寫入位置：

# Query
client = Query(**oauth)                   # 歷史資料
client.register(TweetWriter())            # 寫入
client.user_tweets('realDonaldTrump', 10) # 拿取川普發文資料(10則)# Result (此為 Query 的結果，執行後，你會知道該資料存取位置)

寫入的資料是依照 tweets 時間新舊排序，格式是一則 tweets 一個 json，不同則 tweets 之間換行，如以下：

{'created_at': 'Thu..', 'id': .., 'id_str': '..', 'text': '..', ..}
{'created_at': 'Wed..', 'id': .., 'id_str': '..', 'text': '..', ..}
{'created_at': 'Wed..', 'id': .., 'id_str': '..', 'text': '..', ..}

讀取

有兩種方式，目標是讀取 .json 檔裡 text 欄位所對應到的值，該值代表了 tweets 內容 (也就是川普的發文內容)。為了凸顯 NLTK 的簡潔性，先看無 NLTK 支援版：

(1) 無 NLTK 支援

以 .json 檔讀取資料後轉換成 dict ，再取出 text 欄位所對應到的值後，便可繼續以 python 處理使用：

# 讀取川普發文資料 (text欄位)
import json
tw_list = []
filename = '/Users/youngmihuang/twitter-files/tweets.20180726-155316.json'
with open(filename, 'r') as f:
    line = f.readlines() 
    for i in line:
        twe = json.loads(i)
        each = twe['text'] 
        tw_list.append(each)
        print(each)# Result (僅列前3項)
European Union representatives told me that they would start buying soybeans from our great farmers immediately. Al… https://t.co/Yuqt4KNeDz
Great to be back on track with the European Union. This was a big day for free and fair trade!
Thank you Georgia! They say that my endorsement last week of Brian Kemp, in the Republican Primary for Governor aga… https://t.co/M1mCXcWQup

(2) NLTK 支援（簡潔版）

json2csv() 是 nltk.twitter 模組底下支援的功能，可以將 .json 檔直接轉成 .csv 檔後讀取使用：

# 使用 json2csv 存取川普發文資料 (text欄位)
from nltk.corpus import twitter_samples
from nltk.twitter.common import json2csvinput_file = twitter_samples.abspath('/Users/youngmihuang/twitter-files/tweets.20180726-155316.json') 
with open(input_file) as fp:
    json2csv(fp, 'tweets_text.csv', ['text'])# 讀取
data = pd.read_csv('tweets_text.csv')
for line in data.text:
    print(line)

除此之外，也可結合 NLTK 的斷詞功能 tokenized()：

# 斷詞
from nltk.corpus import twitter_samples
tokenized = twitter_samples.tokenized(input_file) # 川普
for tok in tokenized[:5]:
    print(tok)# Result (僅列前3項)
['European', 'Union', 'representatives',..]
['Great', 'to', 'be', 'back',..]
['Thank', 'you', 'Georgia',..]

NLTK 同時也能針對 text 欄位進行特徵提取，例如取 hashtags、user_mentions、 media_url ( 圖片網址 )、url ( 前往連結 ) 等，都可以透過 json2csv_entities() 支援，非常方便。