Neuronpedia - AI Safety Game — LessWrong

seasonone@opidea.xyz · 1 year ago

Neuronpedia - AI Safety Game — LessWrong

TechLich@lemmy.world · 1 year ago

I think the idea is that there are potentially alignment issues in LLMs because it’s not clear what concepts map to what activations. That makes it difficult to see what they’re really “thinking” about when they generate text. Eg. if they’re being misleading or are incorrectly associating concepts that shouldn’t be connected etc.

The idea here is to use some mechanistic interpretability stuff to see what text activates what neurons in an LLM and then crowd source the meanings behind that and see if that’s something you could use to look up some context from an ai. Sort of trying to make a “Wikipedia of AI mind reading”

Dunno how practical it is or how effective that approach is but it’s an interesting idea.

Warning: Some posts on this platform may contain adult material intended for mature audiences only. Viewer discretion is advised. By clicking ‘Continue’, you confirm that you are 18 years or older and consent to viewing explicit content.

Neuronpedia - AI Safety Game — LessWrong

Neuronpedia - AI Safety Game — LessWrong