• digdilem@lemmy.ml
    link
    fedilink
    English
    arrow-up
    23
    ·
    2 months ago

    robots.txt does not work. I don’t think it ever has - it’s an honour system with no penalty for ignoring it.

    I have a few low traffic sites hosted at home, and when a crawler takes an interest they can totally flood my connection. I’m using cloudflare and being incredibly aggressive with my filtering but so many bots are ignoring robots.txt as well as lying about who they are with humanesque UAs that it’s having a real impact on my ability to provide the sites for humans.

    Over the past year it’s got around ten times worse. I woke up this morning to find my connection at a crawl and on checking the logs, AmazonBot has been hitting one site 12000 times an hour, and that’s one of the more well-behaved bots. But there’s thousands and thousands of them.