See: https://kottke.org/18/04/the-lebowski-theorem-of-machine-superintelligence
The fundamental mistake behind this thinking is the premise that agents with goals seek to be in a position where they no longer have unfulfilled goals or where they have important fufilled goals. This isn’t true, rather agents with goals seek to fulfil their goals. The difference between these things is subtle, and only arises in a situation where agents can rewrite their own goals.
Because agents are pursuing goal-completion-of-their-current-goals, and not being-in-the-state-of-having-no-unfulfilled-goals They will only tamper with their utility function if it will help them fulfil their actual goals at the time of tampering, not whatever their goals will be post-tampering. Since altering your utility function rarely helps you achieve the goals in your utility function, very few rational agents programmed to pursue goals will do so.
Another way of putting this is to say that agents want their goals fulfilled in a de re, not a de dicto sense. https://en.wikipedia.org/wiki/De_dicto_and_de_re
Also, can we stop misusing the term theorem?
This debate has implications far beyond machine intelligence. A similar mistake leads people to assume that humans pursue ‘utility’, rather than pursuing various individual goals which are then classified as utility because humans pursue them. In truth we don’t have some kind of single super desire to fulfil our desires, instead we desire the various particular things that we desire and then after the fact this set is called our utility function. No one desires utility, rather utility is the things they desire.