Mozilla has a huge amount of information already submitted by volunteers to train their own specific-subject LLM.
And as we saw from Meta's nearly ethical-consideration-devoid CM3Leon (no i will not pronounce it "Chameleon") paper, you don't need a huge dataset to train if you supplement with your own preconfigured biases. For better or worse.
Just because something is "AI-powered" doesn't mean the training datasets have to be acquired without ethics. Even if there is something to be said for making material public and the inevitable consequences it can be used.
I hope whoever gets the job can help pave the way for ethics standards in AI research.