A scrapy project can crawl search result of Google/Bing/Baidu
copying by https://github.com/xtt129/seCrawler and rewrite,adding title and abstract.
python 3.5 and scrapy is needed.
run one command to get 50 pages result from search engine with keyword, the result would be kept in the “urls.txt” under the current directory.
scrapy crawl keywordSpider -a keyword=Spider-Man -a se=bing -a pages=50
scrapy crawl keywordSpider -a keyword=Spider-Man -a se=baidu -a pages=50
scrapy crawl keywordSpider -a keyword=Spider-Man -a se=google -a pages=50
url,title and abstract will be stored in the urls.txt
The project doesn’t provide any workaround to the anti-spider measure like CAPTCHA, IP ban list, etc.
But to reduce these measures, we recommand to set
DOWNLOAD_DELAY=10 in settings.py file to add a temporisation (in second) between the crawl of two pages, see details in Scrapy Setting.