PTT Crawler

Avatar of 江玠廷.
Avatar of 江玠廷.

PTT Crawler

Data Engineer
Taipei City, Taiwan
https://github.com/TimJJTing/ptt_crawler 針對 PTT 的爬蟲,使用 Scrapy 實作,面對文章有簽名檔或編輯紀錄等複雜情況也能正確擷取文章內容。 PTT crawler is a scrapy project for crawling PTT articles and is ready for the deployment on Scrapinghub. The key feature makes it differ from other similiar projects is that some possible complex PTT article patterns are considered (e.g. signature files (簽名檔), edited articles) and some algorithms are applied to deal with important values that are missing in sources (e.g. exact time and date of comments), so that a better data quality can be guaranteed.
Avatar of the user.
請先登入再留言。

發布時間: 2019年1月13日
138
7
0

Crawler
python

分享