PTT Crawler

Avatar of 江玠廷.
Avatar of 江玠廷.

PTT Crawler

Data Engineer
Taipei City, Taiwan
https://github.com/TimJJTing/ptt_crawler 針對 PTT 的爬蟲,使用 Scrapy 實作,面對文章有簽名檔或編輯紀錄等複雜情況也能正確擷取文章內容。 PTT crawler is a scrapy project for crawling PTT articles and is ready for the deployment on Scrapinghub. The key feature makes it differ from other similiar projects is that some possible complex PTT article patterns are considered (e.g. signature files (簽名檔), edited articles) and some algorithms are applied to deal with important values that are missing in sources (e.g. exact time and date of comments), so that a better data quality can be guaranteed.
Avatar of the user.
Please login to comment.

Published: Jan 13th 2019
138
7
0

Crawler
python

Share