Hacker Newsnew | past | comments | ask | show | jobs | submit | maerten's commentslogin

Nice article!

> The second kind is nastier. > > They change things in a way that doesn't make your scraper fail. Instead the scraping continues as before, visiting all the links and scraping all the products.

I have found that it is best to split the task of scraping and parsing into separate processes. By saving the raw JSON or HTML, you can always go back and apply fixes to your parser.

I have built a similar system and website for the Netherlands, as part of my master's project: https://www.superprijsvergelijker.nl/

Most of the scraping in my project is done by doing simple HTTP calls to JSON apis. For some websites, a Playwright instance is used to get a valid session cookie and circumvent bot protection and captchas. The rest of the crawler/scraper, parsers and APIs are build using Haskell and run on AWS ECS. The website is NextJS.

The main challenge I have been trying to work on, is trying to link products from different supermarkets, so that you can list prices in a single view. See for example: https://www.superprijsvergelijker.nl/supermarkt-aanbieding/6...

It works for the most part, as long as at least one correct barcode number is provided for a product.


Thanks!

> I have found that it is best to split the task of scraping and parsing into separate processes. By saving the raw JSON or HTML, you can always go back and apply fixes to your parser.

Yes, that's exactly what I've been doing and it saved me more times than I'd care to admit!


Awesome, have been looking for something like this!


i just downloaded the app on iPhone and the dive planner allows you to change the gas mix (oxygen percentage) so it looks like it supports nitrox.


@dockimbel: perhaps adding some syntax coloring of your code example might help here, the syntax is hard to parse if you don't know what to look for..


Down indeed


We're using CasperJS (basically a simple interface to PhantomJS), works great for us. Have a look here: http://casperjs.org/


Pretty cool, although it keeps crashing my Safari on OSX Yosemite beta somehow :)


I wasn't a fan of glossy displays either, but the screen on the rMBP seems a lot less reflective than the previous generations' Macbook Pros (the ones with build-in DVD drive). The reflections are much darker, and kind of hard to notice.


Nice, although tmux -S /tmp/pairprog and tmux -S /tmp/pairprog attach isn't that hard to type :-)


In this example, you are both connected to the same session. Each client can't focus on a separate window. It looks something like this:

         +-------------+
         | tmux server |
         +-------------+
            /
       +---------+
       | session |
       +---------+
        /       \ 
  +---------+ +---------+
  | client0 | | client1 |
  +---------+ +---------+
- `tmux server` is the 'backend' process listening on /tmp/pairprog

- `client0` and `client1` are the 'frontend' processes connecting to the socket /tmp/pairprog.

- `session` is the collection of windows that you are using.

The issue is that the currently active/focused window is an attribute of the session, meaning that all connected clients are always focused on the same window. What if you want to have each client focus on a separate window? This is behaviour that you get by default in screen, but you have to work a little more for in tmux.

The script is just a simple way to use session grouping, which would look like this:

         +-------------+
         | tmux server |
         +-------------+
            /        \
       +---------+ +-----------+
       | session |=| session-1 |
       +---------+ +-----------+
            |           |
       +---------+ +---------+
       | client0 | | client1 |
       +---------+ +---------+
Note the '=' between the sessions. I'm using this to denote that `session` and `session-1` are grouped, meaning that they share the same windows. Since each client is connected to a different session, they can switch windows independently.


It's one reason I hadn't used tmux before, i couldn't figure out a nice way to get this. I love to use screen this way on multimonitor setups, and tmux's other windowing features would make this awesome.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: