> The second kind is nastier.
>
> They change things in a way that doesn't make your scraper fail. Instead the scraping continues as before, visiting all the links and scraping all the products.
I have found that it is best to split the task of scraping and parsing into separate processes. By saving the raw JSON or HTML, you can always go back and apply fixes to your parser.
Most of the scraping in my project is done by doing simple HTTP calls to JSON apis. For some websites, a Playwright instance is used to get a valid session cookie and circumvent bot protection and captchas. The rest of the crawler/scraper, parsers and APIs are build using Haskell and run on AWS ECS. The website is NextJS.
> I have found that it is best to split the task of scraping and parsing into separate processes. By saving the raw JSON or HTML, you can always go back and apply fixes to your parser.
Yes, that's exactly what I've been doing and it saved me more times than I'd care to admit!
I wasn't a fan of glossy displays either, but the screen on the rMBP seems a lot less reflective than the previous generations' Macbook Pros (the ones with build-in DVD drive). The reflections are much darker, and kind of hard to notice.
- `tmux server` is the 'backend' process listening on /tmp/pairprog
- `client0` and `client1` are the 'frontend' processes connecting to the socket /tmp/pairprog.
- `session` is the collection of windows that you are using.
The issue is that the currently active/focused window is an attribute of the session, meaning that all connected clients are always focused on the same window. What if you want to have each client focus on a separate window? This is behaviour that you get by default in screen, but you have to work a little more for in tmux.
The script is just a simple way to use session grouping, which would look like this:
Note the '=' between the sessions. I'm using this to denote that `session` and `session-1` are grouped, meaning that they share the same windows. Since each client is connected to a different session, they can switch windows independently.
It's one reason I hadn't used tmux before, i couldn't figure out a nice way to get this. I love to use screen this way on multimonitor setups, and tmux's other windowing features would make this awesome.
> The second kind is nastier. > > They change things in a way that doesn't make your scraper fail. Instead the scraping continues as before, visiting all the links and scraping all the products.
I have found that it is best to split the task of scraping and parsing into separate processes. By saving the raw JSON or HTML, you can always go back and apply fixes to your parser.
I have built a similar system and website for the Netherlands, as part of my master's project: https://www.superprijsvergelijker.nl/
Most of the scraping in my project is done by doing simple HTTP calls to JSON apis. For some websites, a Playwright instance is used to get a valid session cookie and circumvent bot protection and captchas. The rest of the crawler/scraper, parsers and APIs are build using Haskell and run on AWS ECS. The website is NextJS.
The main challenge I have been trying to work on, is trying to link products from different supermarkets, so that you can list prices in a single view. See for example: https://www.superprijsvergelijker.nl/supermarkt-aanbieding/6...
It works for the most part, as long as at least one correct barcode number is provided for a product.