Recently, I worked on the issue of product classification based on their one-sentence descriptions. My goal was to detect collection names or series of products (e.g., for Adidas shoes, it is copa, superstar, etc.). Different solutions are available during small research. Unfortunately, as in the case of product names, they did quite well, therefore in the fact collection names, it was worst. In today’s post, I wanted to share my concept of solving this problem that does not use machine learning.
Scrapy is the best and most flexible web scraping tool I’ve found so far. The speed at which scripts are created depends on the structure of the site being analyzed, the bot security used, and the amount of data downloaded. In standard cases, the creation and implementation of a web scraper can take literally 15 minutes. This entry is a short tutorial to the tool. I will show how to create a simple web scraper (on the example of a famous advertisement website) and how to use a Scrapinghub website to implement a script so that it runs cyclically.
In a series of entries on work automation, I would like to focus on discussing examples of improving recurring tasks. I will try to present solutions to the same problems using various tools (like R, Python, VBA, etc.). Today’s entry concerns the automation of a simple process using R. If you often deal with cyclical tasks such as download data > summarise data > prepare Word document > send an email and do not do it in an automated way, then this entry is for you!
I’ve been doing web scraping for over three years. For this purpose, I use bash, VBA, Google Sheets, R, and Python. Recently, during the conference WhyR? 2017 and DATA SCIENCE? AGHree! 2018, I had the pleasure of leading web scraping workshops in R. During the preparation of the workshop, I came across interesting protection against automatic data downloading. In the “series” of entries regarding web scraping, I would like to share some of the problems and ideas for solving them.