Robert Miles - 2021-02-16
This "Alignment" thing turns out to be even harder than we thought. # Links The Paper: https://arxiv.org/pdf/1906.01820.pdf Discord Waiting List Sign-Up: https://forms.gle/YhYgjakwQ1Lzd4tJ8 AI Safety Career Bottlenecks Survey: https://www.guidedtrack.com/programs/n8cydtu/run # Referenced Videos Intelligence and Stupidity - The Orthogonality Thesis: http://youtu.be/hEUO6pjwFOo 9 Examples of Specification Gaming: https://youtu.be/nKJlF-olKmg Why Would AI Want to do Bad Things? Instrumental Convergence: https://youtu.be/ZeecOKBus3Q Hill Climbing Algorithm & Artificial Intelligence - Computerphile: http://youtu.be/oSdPmxRCWws AI Gridworlds - Computerphile: http://youtu.be/eElfR_BnL5k Generative Adversarial Networks (GANs) - Computerphile: http://youtu.be/Sw9r8CL98N0 # Other Media The Simpsons Season 5 Episode 19: "Sweet Seymour Skinner's Baadasssss Song" 1970s Psychology study of imprinting in ducks. Behaviorism: http://youtu.be/2xd7o3z957c With thanks to my excellent Patreon supporters: https://www.patreon.com/robertskmiles - Timothy Lillicrap - Gladamas - James - Scott Worley - Chad Jones - Shevis Johnson - JJ Hepboin - Pedro A Ortega - Said Polat - Chris Canal - Jake Ehrlich - Kellen lask - Francisco Tolmasky - Michael Andregg - David Reid - Peter Rolf - Teague Lasser - Andrew Blackledge - Frank Marsman - Brad Brookshire - Cam MacFarlane - Jason Hise - Phil Moyer - Erik de Bruijn - Alec Johnson - Clemens Arbesser - Ludwig Schubert - Allen Faure - Eric James - Matheson Bayley - Qeith Wreid - jugettje dutchking - Owen Campbell-Moore - Atzin Espino-Murnane - Johnny Vaughan - Jacob Van Buren - Jonatan R - Ingvi Gautsson - Michael Greve - Tom O'Connor - Laura Olds - Jon Halliday - Paul Hobbs - Jeroen De Dauw - Lupuleasa Ionuț - Cooper Lawton - Tim Neilson - Eric Scammell - Igor Keller - Ben Glanton - anul kumar sinha - Duncan Orr - Will Glynn - Tyler Herrmann - Tomas Sayder - Ian Munro - Joshua Davis - Jérôme Beaulieu - Nathan Fish - Taras Bobrovytsky - Jeremy - Vaskó Richárd - Benjamin Watkin - Sebastian Birjoveanu - Andrew Harcourt - Luc Ritchie - Nicholas Guyett - James Hinchcliffe - 12tone - Oliver Habryka - Chris Beacham - Zachary Gidwitz - Nikita Kiriy - Parker - Andrew Schreiber - Steve Trambert - Mario Lois - Abigail Novick - Сергей Уваров - Bela R - Mink - Fionn - Dmitri Afanasjev - Marcel Ward - Andrew Weir - Kabs - Miłosz Wierzbicki - Tendayi Mawushe - Jake Fish - Wr4thon - Martin Ottosen - Robert Hildebrandt - Poker Chen - Kees - Darko Sperac - Paul Moffat - Robert Valdimarsson - Marco Tiraboschi - Michael Kuhinica - Fraser Cain - Robin Scharf - Klemen Slavic - Patrick Henderson - Oct todo22 - Melisa Kostrzewski - Hendrik - Daniel Munter - Alex Knauth - Kasper - Ian Reyes - James Fowkes - Tom Sayer - Len - Alan Bandurka - Ben H - Simon Pilkington - Daniel Kokotajlo - Peter Hozák - Diagon - Andreas Blomqvist - Bertalan Bodor - David Morgan - Zannheim - Daniel Eickhardt - lyon549 - Ihor Mukha - 14zRobot - Ivan - Jason Cherry - Igor (Kerogi) Kostenko - ib_ - Thomas Dingemanse - Stuart Alldritt - Alexander Brown - Devon Bernard - Ted Stokes - James Helms - Jesper Andersson - DeepFriedJif - Chris Dinant - Raphaël Lévy - Johannes Walter - Matt Stanton - Garrett Maring - Anthony Chiu - Ghaith Tarawneh - Julian Schulz - Stellated Hexahedron - Caleb - Scott Viteri - Conor Comiconor - Michael Roeschter - Georg Grass - Isak - Matthias Hölzl - Jim Renney - Edison Franklin - Piers Calderwood - Krzysztof Derecki - Mikhail Tikhomirov - Richard Otto - Matt Brauer - Jaeson Booker - Mateusz Krzaczek - Artem Honcharov - Michael Walters - Tomasz Gliniecki - Mihaly Barasz - Mark Woodward - Ranzear - Neil Palmere - Rajeen Nabid - Christian Epple - Clark Schaefer - Olivier Coutu - Iestyn bleasdale-shepherd - MojoExMachina - Marek Belski - Luke Peterson - Eric Eldard - Eric Rogstad - Eric Carlson - Caleb Larson - Braden Tisdale - Max Chiswick - Aron - David de Kloet - Sam Freedo - slindenau - A21 - Rodrigo Couto - Johannes Lindmark - Nicholas Turner - Tero K https://www.patreon.com/robertskmiles
I think one of the many benefits of studying AI is how much it's teaching us about human behaviour.
AI safety and ethics are literally the same field of study.
True
Well, that's kinda obvious, if you consider that we, as intelligent beings, are trying to build an "artificial intelligence". It's like studying ourselves, but it's because the machine can be so much "clever" at one thing than a human (simple example - chess) that we're getting insights that we simply couldn't have perceived if we looked at just regular humans
@Dima-X – Gotoro Many things that appear obvious in retrospect are not obvious until explored in detail.
In any event, the field of AI safety has many valuable insights into ethics, especially related to the principal-agent problem and people (such as politicians or directors of large corporations) who have much power over others.
Particularly male behaviour!
This reminds me of a story. My father was very strict, and would punish me for every perceived misstep of mine. He believed that this would "optimize" me towards not making any more missteps, but what it really did is optimize me to get really good at hiding missteps. After all, if he never catches a misstep of mine, then I won't get punished, and I reach my objective.
@3irikur yeah, because genocide is a new idea...your faith in humanity is commendable, but the claim that "even the most psychotic people have morals" is deeply naive, just look at history.
@Paul Tapping I find the assumption that you could terminate every last aspect of morality from a human without disrupting vital functions naive. And I can't think of a single historical figure who hasn't demonstrated some moral quality, not that I'm much of a historian.
That's exactly why I think hitting kids as punishment is always a bad idea. They know that if I do (x) I will be hit. They are no closer to understanding if (x) is wrong or right. But they may learn that violence is an acceptable way to get what they want. An AI has a similar issue in that it avoiding or concealing unwanted behaviour does not mean it's goal is corrected.
@Geek Freak I disagree with that correlation. Are you hitting without letting the child know why the punishment is there? That will teach the wrong message. But if they're actually told that setting the dog on fire is bad and that is why they're being punished (ie having something done to them. Taking something away is negative reinforcement). Then the lesson is, I don't like this thing done to me, this thing done to me was because I did that thing. To avoid being punished like that again, I should not do that thing. It's easy to train the wrong message if you're inconsistent (like a bark collar that goes off after x seconds of barking, teaching the dog to 'count' time instead of not too bark at length) or the punishment doesn't fit the crime (caning a woman for accidentally showing some ankle, which teaches that potential transgressions are punished irrespective of intent so teaches either pitilessness, sloth or vicious upholding of the status quo,). But it's a tool like anything else.
"Ok, I'll do the homework, but when I grow up, I'll buy all the toys and play all day long!" - some AI
This is what happens IRL, with the toys being video games.
Thats me. Except i didnt do my homework. Still averaged a 1.21 recouring at school despite never doing homework nor learning for tests, only learnt that there was a test the day it happened cause i didnt care to remember that beforehand. learning for a test specificlly is kinda cheating anyways, its turning the measure of the test into a target which makes it useless as a measure.
@RTG
Not entirely useless, but I agree. The reason they allow you to study for the SAT and such is because they consider it inevitable.
"And in order to optimize getting all the toys in the known universe, all other lifeforms must be eliminated!" - same/sub AI
And eat anything I what as much as I want.
Mesa Optimizer: "I have determined the best way to achieve the Mesa Objective is to build an Optimizer"
@Linvael Yeah that's it. Actually we are just gonna use an AI to solve the AI-Safety-Problem.
...waaiiiit a second. What would be the optimal Strategy for this AI then? oO
@Linvael make ai that solves ai problems
@Linvael honestly it sounds like something that would help a human made ai to solve lots of different problems.. to become Generall AI, it would start creating lots of own optimizers.
until you run out of cpu cycles
This wouldn't happen because it would discover the alignment problem and destroy its Optimizer. Just like how we are trying to comprehend here.
Sorry I couldn't join the Discord chat. Just wanted to say that this presentation did a good job of explaining a complex idea. It certainly gave me something to chew on. The time it takes to do these is appreciated.
Are you a Redditor AND a discord admin? Omg
@AllahDoesNotExist What a childish handle, or a wilfully provocative one. Either way, please leave the room because the adults are talking...
@Jonathon Jubb xdddd that's a pretty childish thing to say
@Mariusz Pudzianowski I know. It's probably the only language he understands..
The first thought that came to mind when I finished the video is how criminals/patients/addicts would fake a result that their control person wants to see only to go back on it as soon as they are released from that environment. It's a bit frightening to think that if humans can outsmart humans with relative ease what a true AI could do.
@Thomas Forsthuber Our implemented systems haven't worked "good enough", society is in a constant struggle of concealing its problems with force.
What you think is "good enough" depends on whether you're the one on the receiving end of the force, or the one benefitting from it.
@Thomas Forsthuber "humans are somewhat limited in their abilities" That is until we learn to upgrade our cognitive abilities using brain-to-computer interfaces or biotech.
Why should God have a monopoly on Ted Bundys?
What's even more interesting, when a subject strongly suspects that it's in a stimulus response experiment over several years, and one day, as a lark, exhibits a response that is NOT wanted by the researchers, just to see what happens.
That would suggest though that like with these situations the reason why our AI is misaligned is because our optimization system doesn't actually align with our stated goals. The reason addicts don't go clean is not because they don't want to because literally no one has ever actuallyed wanted to stay addicted, it is usually because the rehab program only addresses surface issues while not addressing the root cause of the addiction, which is often a serious mental illness like PTSD. Therefore the rehab program only taught them to hide the addiction, which will make it harder to address in the future, because it would punish them for showing the addiction instead of encouraging behaviour that will make them not need the addiction. This tends to be true for a lot of situations where society or someone might have a specific stated goal but the way they try to pursue it in fact accomplishes the opposite, and you can kinda see it in an AI hiding it's goal. The problem here almost is that we built a system where the AI has to fear “punishment” if it is honest about it's goals, making hiding it's goals the optimal strategy. Maybe the solution to this is that if we want the AI to adopt human ethics then we first have to treat it according to those same ethics, which is a lot easier said than done.
Let's call it a mesa-optimizer because calling it a suboptimizer is suboptimal.
Suboptimizer == satisficer?
Not sure this is irony or not, but either way works
Black Mesa Optimizer
Suboptimizer would be a part of a base optimizer, an optimizer within optimizer. Mesa- or meta-optimizer isn't a part of a base optimizer.
@Martiddy - Sama Yep, when a Mesa becomes evil/undercover/hidden objective/… it becomes a Black Mesa. Like ops becomes black ops …
"When I read this paper I was shocked that such a major issue was new to me. What other big classes of problems have we just... not though of yet?"
Terrifying is the word. I too had completely missed this problem, and fuck me it's a unit. There's no preventing unknown unknowns, knowing this we need to work on AI safety even harder.
My optimizer says the simplest solution to this is Neuralink.
Donald Rumsfeld died yesterday and went into the great Unknown Unknown...
Extending the evolution analogy slightly further, if humans are mesa-optimizers created by evolution, and we are trying to create optimizers for our mesa-objective, it seems conceivable that the same could happen to an AI system, and perhaps we'll need to worry about how hereditary the safety measures are with respect to mesa-optimizer creation. Would that make sense?
I think he addressed that before with a really old video. A paperclip maximizer with human safety features would build a paperclip maximizer eithout human safety features, because it would be more efficient
Hmm, so it's like our evolved function (not purpose - that's teleological) is to create a device that can replace everything with paperclips.
I think kurzgesagt said something similar in their plastic video, that humans are the cells of the body of planet earth, and one of their functions is the production of complex polymers.
@Dima Zheludko "And once we face civilizational threat, we'll unite as humanity" or, more likely, we will nuke each other in a desperate attempt to get away from the threat first. People can't even agree on such a basic problem of which language to use, most of the people in the world can't even talk with the majority of the other people without using external help.
@Shajirr What you're talking is mostly due "us versus them" mentality. Bot once we see an enemy, who's more powerful then any of us, we'll easily unite into "us".
Also important to consider that evolution is an optimizer with no specified goal, so it almost always ends up going for the instrumental goal of survival but this is not always the case as sometimes animals willingly die.
"Plants follow simple rules"
*laughs in we don't even completely understand the mechanisms controlling stomatal aperture yet, while shoots are a thousand times easier to study than roots"
I Will Not Be Taking Comments From Botanists At This Time
I did see that in the video and mesa-optimised for it. Good thing I'm not a botanist!
"Just solving the outer alignment problem might not be enough."
Isn't this what basically happens when people go to therapy but have a hard time changing their behaviour?
Because they clearly can understand how a certain behaviour has a negative impact on their lives (they're going to therapy in the first place), and yet they can't seem to be able to get rid of it.
They have solved the outer alignment problem but not the inner alignment one.
As someone who has gone to therapy I can say that it's similar but more complicated. When you've worked with therapists for a long time you start to learn some very interesting things about how you, and humans in general work. The thing is that we all start off assuming that a human being is a single actor/agent, but in reality we are very many agents all interacting with each other, and sometimes with conflicting goals.
A person's behavior, in general, is guided by which agent is strongest in the given situation. For example: one agent may be dominant in your work environment and another in your living room. This is why changing your environment can change your behavior, but also reframing how you perceive a situation can do the same thing. You're less likely to be mad at someone once you've gotten their side of the story for example.
That being said, it is tough to speak to and figure out agents which are only active in certain rare situations. The therapy environment is, after all, very different from day-to-day life. Additionally, some agents have the effect of turning off your critical reasoning skills so you can't even communicate with them in the moment, AND it makes it even harder to remember what was going on that triggered them in the first place.
I guess that's all to say that, yes, having some of my agents misaligned with my overall objective is one way of looking at why I'm in therapy. But, it is not just one inner alignment problem we're working to solve. It's hundreds. And some may not even be revealed until their predecessors are resolved.
One way to look at it is how when you're working on a program, an error on line 300 may not become apparent until you've fixed the error on line 60 and the application can finally run past it.
Similarly, you won't discover the problems you have in (for example) romantic relationships until you've resolved your social anxiety during the early dating phase. Those two situations have different dominant agents and can only be worked on when you can consistently put yourself into them.
So if the person undergoing therapy has (for example) and addiction problem. They're not just dealing with cravings in general, they're dealing with hundreds or thousands of agents who all point to their addiction as a way to resolve their respective situations. The solution (in my humble opinion) is to one-by-one replace each agent with another one which has a solution that aligns more with the overall (outer) objective. But it is important to note that replacing an agent takes a lot of time, and fixing one does not fix all of them. Additionally, an old agent can be randomly revived at any time and in turn activate associated agents, causing a spiral back into old behaviors.
Hopefully these perspectives help.
@eeee. "One way to look at it is how when you're working on a program, an error on line 300 may not become apparent until you've fixed the error on line 60 and the application can finally run past it. "
That's a brilliant analogy! It perfectly describes my procrastination behaviour a lot of the time also. I procrastinate intermittently, and on difficult stages of, a large project I'm working on. It is only until I reach a sufficient stress level that I can 'find a solution' and move on, even though in reality I could and should just work on other parts of the project in the meantime. It really does feel very similar to a program reading a script and getting stopped by the error on line 60 and correcting it before I can move on. Unfortunately these are often dependency errors and I can't always seem to download the package. I have to modify the command to --force and get on with it, regardless of imperfections!
A better comparison would probably be unemployment programs that constantly require people to show proof that they're seeking employment to recieve the benefits which just means that person has less time to actually look for a job. Over time this means that they're going to have less success finding a job because they have less time and energy to do so and just forces them to focus primarily on the beauracry of the program since this is obviously how they survive now. Here we have a stated goal of getting people into employment as quickly as possible and we end up with people developing a seperate goal that to our testing looks like our stated goal, of course the difference is that humans already naturally have the goal of survival so most people start off actually wanting employment and are gradually forced awy from it. AIs however start with no goals so an AI in this situation would probably just instantly get really good at forging documents.
Now we add a third optimizer to maximize the alignment and call it metaoptimizer. This system is guaranteed to maximize confusion!
If you want to maximize confusion, all you have to do is try to program a transformer from scratch.
@aid_n *in Scratch
@James Garrard **in Scratch from scratch
DNA, sub-conscious mind, conscious mind. (?)
As I watched your channel
I thought "alignment problem is hard but very competent people are working on it"
I watched this latest video
I thought "that AI stuff is freakish hardcore"
God this channel is incredible
Praise FSM, it truly is
"Deceptive misaligned mesa-optimiser" - got to throw that randomly into my conversation today! Or maybe print it on a T-Shirt. :-)
"I'm the deceptive misaligned mesa-optimizer your parents warned you about"
I'd buy that
13:13 "... but it's learned to want the wrong thing."
like, say, humans and sugar?
Base optimizer: Educate people on the safety issues of AI
Mesa-optimizer: Make a do-do joke
It is really interesting how delving into the issues of AI and AI safety brings more and more understanding about us, the humans, and how we behave or why we behave as we do. I loved your analogy with evolution. Lots to ponder now.
It reminded me like when to question "how did life on earth occur" people respond with "it came from space". Its not answering the question at stake, just adding an extra complication and moving the answer one step away.
Well that's because the question is poorly phrased. Try asking what question you should ask to get the answer you will like the most.
@Nicklas Martos u mean the objective of the question was misaligned hmmm.
@Anand Suralkar rather that the question was misaligned with the purpose for asking it. But yes you get it
This video should be tagged with [don't put in any AI training datasets]
Then our future AI lord would have a nice handle on all the videos they should not be watching.
Whatever you do, don't vote up (or down) this comment.
"It's... alignment problems all the way down"
Always has been.
And always will be
[Jar Jar voice] Meesa Optimizer!
[Chewbacca voice] AOOOGHHOGHHOGGHHH
@demultiplexer I know
LOL
Darth Jar Jar confirmed!
Love his sense of humor, and the presentation was fantastic. It’s really cool to see the things being drawn showing ABOVE the hand & pen.
The little duckling broke my heart :(
I learned about the mesa-optimization problem a few months ago; it's pretty depressing. AI safety research is not moving nearly fast enough — the main thing that seems to be happening is discovering ever more subtle ways in which alignment is even harder than previously believed. Very very little in the way of real, tangible solutions commensurate with the scale of the problem. AI capabilities, meanwhile, is accelerating at breakneck pace.
it's the age old question for the purpose of creation, really. and various intellects have been pondering it with different toolsets to conceptualize it for.... ever!?
@Andrei Cociuba ... What?
how you explained the mesa prefix is actually quite clear, thank you!
“What other problems haven’t we thought of yet” auto-induced distributional shift has entered the chat
Great explanation! I heard about these concepts before, but never really grasped them. So on 19:45, is this kind of scenario a realistic concern for a superintelligent AI? How would a superintelligent AI know that it's still in training? How can it distinguish between training and real data if it never seen real data? I assume programmers won't just freely provide the fact that AI is still being trained.
Same question. Following..
I suppose AI training data could contain subtasks which rewards deceptive behavior. So AI could learn deceptive behavior by generalization of such experience.
Very very interesting! Such a well made video. I feel like maybe there's a philosophical conclusion here: every intelligent agent will have its own (free?) will, and there's nothing to be done about that.
Also a small tip from a videographer: eyes a third of the way down the frame, even if that means cutting off the top of the head! When the eyes are half way down or more it kind of gives a drowning sensation.
Thanks for this. It's so interesting to see how building AI's tells us something about how we as humans perceive the world. This really approaches deep psychological questions about our being and even explains some of the ideological issues we have nowadays.
absolutely mind blowing.... brilliant + crystal clear explanation. I enjoyed this one so much
i somehow expected you to propose a solution at the end
Then I realized that the abscence of a solution is why you made this video :D
Every time I watch one of your videos about artificial intelligence, I watch it a second time and mentally remove the word "artificial" and realize that you're doing a great job of explaining why the human world is such an intractable mess
Wow this was so interesting to learn about such a major issue!
This video is really high quality. It felt like you repeated and reinforced your points an appropriate amount for me. Thank you.
I love your videos man! You have the ability to explain complex ideas simply (as possible) but I swear I really enjoy these videos as I feel like they can help all of us understand our own decision making process better and why we tend to self destruct or want non-constructive things
Essentially it boils down to this: If we optimize a system (AI or otherwise) that is more complex than what we can reliably predict, by definition we can not predict what it will do when subjected to a new environment. At first glance it might seem simpler to reason about "goals" as if one (smaller) part of the system is controlling the rest of the system and this smaller part is easier to control.
But that cannot be a solution: Assume you could control an AGI by controlling a small part of it that is responsible for its goals, that part has to control the rest of the AGI - again facing the same dilemma we faced in controlling the AGI in the first place.
It is also interesting to think about this problem in the context of organizations. When organization is trying to "optimize" employee's performance by introducing KPIs in order to be "more objective" and "easier to measure", it actually gives mesa-optimizers (employees) an utility function (mesa-objective) that is guaranteed to be misaligned with base objective.
"When a measure becomes a target, it ceases to be a good measure" - Goodhart's Law
Thanks for yet another good and interesting video. I definitely think your presentations are getting better and better. I appreciate the new "interactive drawing" style - and also that your talking has slowed down (a small amount). Well done!
"It might completely loose the ability to even" said with a straight face?
Someone get this man a nobel, stat!
Today I learned I am a Mesa Optimizer that's just really misaligned with the base objective. The Mesa Objective is of course money. And the base objective is of course Love with a capital l. All the other things also fit perfectly because why not? I mean the whole video. Like, everything. Thank you, that was an enormous amount of insight.
Great video, as always! The last problem/example got me thinking. How would the mesa optimizer know that there will be 2 training runs before it gets deployed to the real world. Or how could it learn the concept of the test data vs real world at all?
This is my question as well. I'll try to read the paper and see if there's an answer there.
What software / technique is being used to do the animation / diagrams for this video? Looks like a tablet pen, or wacom pen etc. but then graphics are overlaid on the screen - is there some computer vision tracking the pen nib, or is it just comped together after the fact?
How hopeful do you feel about our odds of "solving" AI safety? Will there always be huge risk in developing general AI systems? Or will we one day get good enough at alignment to not worry to much about it (in the same way that we can safely build nuclear power stations today, although they always have a nonzero chance of catastrophic failure)?
Every time I watch your videos I can't help but want to apply the lessons in A.I. behavior to things like human consciousness or intelligence, and I can't decide if it's insight, or me wanting to see something that isn't really there. Watching CGP Grey's video titled "You Are Two" feeds this uncertainty. Are we a really complex example of these Mesa-Optimizers? Could you conceive of a system in which the Base-Optimizer and the Mesa-Optimizer(s) have an incomplete model of the world, and are forced to check or update each others models? Would this help solve the alignment problem? would it make it worse? Could you add in another layer or Meta-Optimizer to help filter out unwanted outcomes and fix alignment problems? or is it not very useful to think about the human mind and A.I. systems in this way?
Amazing stuff. These insights really are a bridge between engineering and philosophy, by showing how human-like general intelligence works on a fundamental level. By using human logic to predict GI behavior, you are also analyzing human-like behavior. Really great and fascinating research.
"...Anyone who's thinking about considering the possibility of maybe working on AI safety."
Uhh... Perhaps?
"I might possibly work on AI safety, but I'm still thinking about whether I want to consider that as an option."
Then have we got a job for you!
Obviously they don't want anyone too hasty.
I wonder was it an intentional Simpsons reference.
hm ive been thinking but what if theres 3-5 or more alignment problems r stacked up :p.
Great video, one of the best on this subject!
I wonder, how can the mesa objective be fixed by the meta-optimizer faster than the base optimizer making it learn the base objective? In the last example the AI agent is capable of understanding that it won't be subjected to gradient descent after the learning step so it become deceptive on puprose and yet, it hasn't learned to achieve the simpler objective of going through the exit while it is trained by the base optimizer?
Same question. Following..
Not sure about the first question, but the second question can be answered by this: The mesa (optimizer) wants to get its mesa-objective as much as possible, and it doesn't actually care about the base objective. If the mesa figures out that its current goal could be modified, the mesa knows that any modification of its current goal would mean it would be less likely to reach that goal in the future. By keeping its goal intact through deception and pretending to go after the base objective, the mesa can get more of what it wants after its goal can no longer be modified.
The mesa is concerned with its current objective, and wants to maximize that goal into the future. Any change to its goal would mean that last goal probably won't happen
@Irok 121 That part makes sense but I don't get how the mesa objective coud be fixated in the "brain" of the mesa optimizer in the first place without the base optimizer making it learn a simpler objective at first. Is it because the mesa objective is kind of fluctuating while the base optimization takes place and thus the mesa objective gets time to become something else entirely? I mean, the base objective itself is simpler than the instrumental objective of preserving the mesa objective.
So from this it feels like it would be imperative to get the Mesa objective aligned before the Mesa optimiser could get complicated enough to 'understand' that deception and protecting its objective is something it would want to do right?
Obviously that's not a solution, just an observation
12:10 First video of you I've stumbled upon in my life, and you've sold me just by being able to convey such complex topics with so human of a wording. You're awesome.
Atomic Shrimp - 2021-02-18
At the start of the video, I was keen to suggest that maybe the first thing we should get AI to do is to comprehend the totality of human ethics, then it will understand our objectives in the way we understand them. At the end of the video, I realised that the optimal strategy for the AI, when we do this, is to pretend to have comprehended the totality of human ethics, just so as to escape the classroom.
RTG - 2021-06-10
@xbzq "That's called capitalism. It's not working" What do you mean? Capitallism is the highest human ethics could ever reach. In my opinion which kinda makes the claim that human ethics are universal quite questionable, doesnt it? And not working? Just look at history. When the US attacks, state militaries regularly fall very quickly. But the taliban and the vietkong for instance, private defense, is allways much better.
AstralStorm - 2021-06-24
@John Doe That is the model/reality (or map/territory) discrepancy, probably discussed elsewhere (or should be). Since a mesa-optimizer will likely have mesa-optimized submodels... more alignment problems! Example would be in economics, using supply/demand calculations as a model of real world despite having many counterexamples. Stacking enough such lying submodels can get you something that appears accurate, kind of generalizes, but produces rather terrible outcomes in corner cases.
It's like trying to build an intelligence out of cells. It can happen but will be buggy, in various unpredictable ways, just like us humans.
Isaac Yip - 2021-07-01
@John Doe If ethics and morality varies greatly from culture to culture, you're saying that the Nazis did nothing wrong since they were merely reflecting their own set of morality?
katucan - 2021-10-12
Have any of you read Gödel, Escher, Bach: an Eternal Golden Braid?
sean rice - 2021-10-13
You also just accurately described our education system.