Should we block OpenAI from scrapping the server?

m-p{3}@lemmy.ca · 1 year ago

Should we block OpenAI from scrapping the server?

Sunshine @lemmy.ca · 7 days ago

Yes, please prevent them from using our conversations.

ono@lemmy.ca · 1 year ago

Yes, please.

We can’t stop LLM developers from scraping our conversations if they’re determined to do so, but we can at least make our wishes clear. If they respect our wishes, then great. If they don’t, then they’ll be unable to plead ignorance, and our signpost in the road (along with those from other instances) might influence legislation as it’s drafted in the coming years.

Shadow@lemmy.ca · 1 year ago

I’m on board for this, but I feel obliged to point out that it’s basically symbolic and won’t mean anything. Since all the data is federated out, they have a plethora of places to harvest it from - or more likely just run their own activitypub harvester.

I’ve thrown a block into nginx so I don’t need to muck with robots.txt inside the lemmy-ui container.

# curl -H 'User-agent: GPTBot' https://lemmy.ca/ -i
HTTP/2 403

skankhunt42@lemmy.ca · 1 year ago

I imagine they rate limit their requests too so I doubt you’ll notice any difference in resource usage. OVH is Unmetered* so bandwidth isn’t really a concern either.

I don’t think it will hurt anything but adding it is kind of pointless for the reasons you said.

nbailey@lemmy.ca · 1 year ago

Yes. Ban them.

if ($http_user_agent = "GPTBot") {
  return 403;
}

jman269@lemmy.world · 1 year ago

Probably want == instead else we will all be forbidden

Shadow@lemmy.ca · edit-2 1 year ago

I would have thought so too, but == failed the syntax check

2023/08/07 15:36:59 [emerg] 2315181#2315181: unexpected "==" in condition in /etc/nginx/sites-enabled/lemmy.ca.conf:50

You actually want ~ though because GPTBot is just in the user agent, it’s not the full string.

nbailey@lemmy.ca · 1 year ago

Strangely, = works the same as == with nginx. It’s a very strange config format…

https://nginx.org/en/docs/http/ngx_http_rewrite_module.html#if

quesomodo@programming.dev · 1 year ago

Look at me! I’m the GPTBot now!

Shadow@lemmy.ca · 1 year ago

Thanks for empowering my lazyness =)

Lucidlethargy@sh.itjust.works · 1 year ago

1000% yes. Please block them.

sndmn@lemmy.ca · 1 year ago

Is this even possible without all federated instances also prohibiting them?

m-p{3}@lemmy.ca · 1 year ago

You take action where you can ;)

Alligatorade@lemmy.ca · 1 year ago

narF@lemmy.ca · 1 year ago

Are they even respecting those files?

But yeah, sure, it’s worth trying!

m-p{3}@lemmy.ca · edit-2 1 year ago

It’s from the official documentation.

EhForumUser@lemmy.ca · 1 year ago

Worth trying for what reason?

Elise@beehaw.org · 1 year ago

Just out of curiosity, why is everyone so up in arms about this? I mean sure it’s just another corp but any other reasons?

corsicanguppy@lemmy.ca · 1 year ago

Server load spent on a bot scraping our contributions to be used to make money.

There’s so much there that it’s gonna offend someone.

Elise@beehaw.org · 1 year ago

Wouldn’t it just be scraped once (per company)? That doesn’t sound like such a problem.

Alligatorade@lemmy.ca · 1 year ago

deleted by creator

EhForumUser@lemmy.ca · edit-2 1 year ago

No, definitely not. Our work posted in the open is done so because we want it to be open!

It is understandable that not all work wants to be open, but access would already be appropriately locked down for all robots (and humans!) who are not a member of the secret club in those cases. There is no need for special treatment here.

Warning: Some posts on this platform may contain adult material intended for mature audiences only. Viewer discretion is advised. By clicking ‘Continue’, you confirm that you are 18 years or older and consent to viewing explicit content.

Should we block OpenAI from scrapping the server?

Should we block OpenAI from scrapping the server?