Book review: The Roots of Romanticism

I’ve been getting interested in the Romantic movement recently. I’d started to dimly sense its enormous influence on later thought, but I had only a hazy idea of the details. So I picked up Isaiah Berlin’s The Roots of Romanticism to get a better understanding.

I chose this book in particular because I love Berlin’s style. The book was originally a series of lectures, given to an audience in Washington, DC in 1965 and broadcast to BBC radio. It’s not just a transcription, it’s been cleaned up to be more text-like, but still has an enjoyably conversational feel. I’m going to start with a couple of long quotes from the first chapter, ‘In Search of a Definition’, both to give a sense of that style and to set up the central question:

> Suppose you were travelling about Western Europe, say in the 1820s, and suppose you spoke, in France, to the avant-garde young men who were friends of Victor Hugo, Hugolâtres. Suppose you went to Germany and spoke there to the people who had once been visited by Madame de Staël, who had interpreted the German soul to the French. Suppose you had met the Schlegel brothers, who were great theorists of romanticism, or one or two of the friends of Goethe in Weimar, such as the fabulist and poet Tieck, or other persons connected with the romantic movement, and their followers in the universities, students, young men, painters, sculptors, who were deeply influenced by the work of these poets, these dramatists, these critics. Suppose you had spoken in England to someone who had been influenced by, say, Coleridge, or above all by Byron – anyone influenced by Byron, whether in England or France or Italy, or beyond the Rhine, or beyond the Elbe.

These weird new scenes had a baffling mishmash of surface concerns — mysticism, poetry, folklore, free will — and the detailed content of any one scene often outright contradicted that of the others. But somehow at the base of it all was a correlated aesthetic sense:

> Suppose you had spoken to these persons. You would have found that their ideal of life was approximately of the following kind. The values to which they attached the highest importance were such values as integrity, sincerity, readiness to sacrifice one’s life to some inner light, dedication to some ideal for which it is worth sacrificing all that one is, for which it is worth both living and dying. You would have found that they were not primarily interested in knowledge, or in the advance of science, not interested in political power, not interested in happiness, not interested, above all, in adjustment to life, in finding your place in society, in living at peace with your government, even in loyalty to your king, or to your republic. You would have found that common sense, moderation, was very far from their thoughts. You would have found that they believed in the necessity of fighting for your beliefs to the last breath in your body, and you would have found that they believed in the value of martyrdom as such, no matter what the martyrdom was martyrdom for. You would have found that they believed that minorities were more holy than majorities, that failure was nobler than success, which had something shoddy and something vulgar about it.

That’s not your father’s Enlightenment values. Where did all this come from? Is it just a loose cluster of attitudes to life, or does it hold together in some deeper way?

The Romantic bag of ideas

Understanding this better is not a purely academic exercise for me. This doesn’t feel like a dead movement that I’m learning about out of mild historical curiosity. The whole wider culture seems to be stuck in a pendulum swing towards romantic-inspired ideas. I’m reminded of a Slate Star Codex review of The Black Swan, which talks about the previous swing of the pendulum. Taleb’s book was published in 2007, during a wave of enthusiasm for New Atheism, cognitive biases, I Fucking Love Science and the like:

> … it seems like the “moment” for books about rationality came and passed around 2010. Maybe it’s because the relevant science has slowed down – who is doing Kahneman-level work anymore? Maybe it’s because people spent about eight years seeing if knowing about cognitive biases made them more successful at anything, noticed it didn’t, and stopped caring. But reading The Black Swan really does feel like looking back to another era when the public briefly became enraptured by human rationality, and then, after learning a few cool principles, said “whatever” and moved on.

This is all passé now, irrationalism is in, and we’re all supposed to be trading meme stonks or something. (I started writing this at the peak of… whatever the GameStop thing was… and only just remembered to come back and finish it.) There’s a resurgence of fascination with mysticism, with conspiracy theories, with the ontology-blurring effects of psychedelics. This is all vaguely Romanticism-tinged, in the same way that the 2007 zeitgeist was Enlightenment-tinged. It looks suspiciously like we collectively had enough of the Enlightenment bag of ideas and automatically reached out for the other standard-issue bag of ideas that western philosophy has helpfully put within grabbing range.

I wanted to get a better idea of what’s in the bag. It’s not all awful, any more than the Enlightenment bag was awful. There are some deep and important ideas that aren’t in the Enlightenment bag, which is one of the things that makes it so compelling. But it’s not the sort of stuff I want to uncritically load my brain up with.

I don’t want to get too sidetracked into current issues, though. This post is just about taking a look at what’s in the bag. I’ll give a brief summary of some of the main preoccupations of the movement, at least as told by Berlin. I’ll finish with Berlin’s answer to the question of what ties these ideas together.

First, though, I’ve got a couple of reservations about this book which I want to flag before I start. The first is to do with the style. Berlin has this witty, urbane midcentury style which I love – I could read piles of this stuff. It’s not a romantic style at all… it’s not dry or technical either, there’s a bit of warmth to it, but it’s very controlled, there’s a bit of ironic distance, none of the GIANT OUTPOURING OF EMOTION I associate with romanticism. To be honest, I’m much more comfortable with this – I don’t quite get romanticism deep down – but it still makes me suspicious that someone who writes in this style is also going to not quite get it, and miss some of the point. Still, I’m actually willing to read this, and I probably would not read a whole load of romantic rhapsodising.

The second reservation is that I have no idea how accurate any of this is! These are popular talks, and Berlin hardly quotes any primary sources at all, and I certainly haven’t gone and looked any up. He’s an entertaining speaker, but it’s all a little bit too fluent, and I’m suspicious that the entertainment comes at the expense of getting the details correct.

With that massive disclaimer, let’s go on to look at the bag of ideas. Berlin covers the following:

  • Particularism: a fascination with specific details for their own sake, and a distrust of big abstract theories

  • Expressionism: works of art should express the nature of the artist, rather than communicate objective truths

  • The importance of the will and of imposing this will on the world through authentic expression, both on an an individual level and at the scale of nations

  • The grounding of knowledge in action, rather than disinterested inquiry

  • Emphasis on symbolism and mythic understanding

  • An understanding that ordered, rational knowledge only accounts for a small part of experience, and that there are huge murky unexplained depths beneath. Nostalgia and paranoia as hidden creatures in these depths.

I’ll go through these in turn.

Particularism and expressionism

Berlin starts by talking about early influences on romanticism. One key character in this section is someone I’d never heard of, Johann Georg Hamann. From what I can quickly make out from the Berlin book and his Wikipedia article he was mainly notable as a kind of superspreader of the ideas of his time. He introduced Rousseau’s work to Kant, translated Hume into German, influenced Goethe and Hegel. His own work was mostly fragmentary and unfinished, but a recurring theme was a deep suspicion of generalisations, concepts and categories:

> What they left out, of necessity, because they were general, was that which was unique, that which was particular, that which was the specific property of this particular man, or this particular thing. And that alone was of interest, according to Hamann. If you wished to read a book, you were not interested in what this book had in common with many other books. If you looked at a picture, you did not wish to know what principles had gone into the making of this picture, principles which had also gone into the making of a thousand other pictures in a thousand other ages by a thousand different painters. You wished to react directly, to the specific message, to the specific reality, which looking at this picture, reading this book, speaking to this man, praying to this god would convey to you.

Hamann’s protégé Johann Herder shared this fascination with picturesque detail:

> Herder is the father, the ancestor, of all those travellers, all those amateurs, who go round the world ferreting out all kinds of forgotten forms of life, delighting in everything that is peculiar, everything that is odd, everything that is native, everything that is untouched.

This led him towards an expressionist view of the nature of art. Enlightenment thinkers had expected theories of aesthetic beauty to converge on shared, objective properties of the artwork:

> .. what everyone agreed about was that the value of a work of art consisted in the properties which it had, its being what it was – beautiful, symmetrical, shapely, whatever it might be. A silver bowl was beautiful because it was a beautiful bowl, because it had the properties of being beautiful, however that is defined. This had nothing to do with who made it, and it had nothing to do with why it was made.

For Herder, art instead expressed the idiosyncratic attitude towards life of the individual artist. There was no need for these individual attitudes to converge, and indeed the attitudes of different artists can be mutually contradictory. The important thing is for each artist to express their own nature to the fullest extent that they can.

Nationalism and the will

Herder applied these ideas at the group level as well as the individual. Groups of people enmeshed in a similar way of life would naturally share certain attitudes, and these would be reflected in their art:

> If a folk song speaks to you, they said, it is because the people who made it were Germans like yourself, and they spoke to you, who belong with them in the same society; and because they were Germans they used particular nuances, they used particular successions of sounds, they used particular words which, being in some way connected, and swimming on the great tide of words and symbols and experience upon which all Germans swim, have something peculiar to say to certain persons which they cannot say to certain other persons. The Portuguese cannot understand the inwardness of a German song as a German can, and a German cannot understand the inwardness of a Portuguese song, and the very fact that there is such a thing as inwardness at all in these songs is an argument for supposing that these are not simply objects like objects in nature, which do not speak; they are artefacts, that is to say, something which a man has made for the purpose of communicating with another man.

This is a sort of nationalism, and influenced later, much more damaging kinds. Knowing what came later, it’s easy to read this as an argument for hereditary racial differences, but Herder’s version is a culturally transmitted gestalt:

> Herder does not use the criterion of blood, and he does not use the criterion of race. He talks about the nation, but the German word Nation in the eighteenth century did not have the connotation of ‘nation’ in the nineteenth. He speaks of language as a bond, and he speaks of soil as a bond, and the thesis, roughly speaking, is this: That which people who belong to the same group have in common is more directly responsible for their being as they are than that which they have in common with others in other places. To wit, the way in which, let us say, a German rises and sits down, the way in which he dances, the way in which he legislates, his handwriting and his poetry and his music, the way in which he combs his hair and the way in which he philosophises all have some impalpable common gestalt.

Most importantly, Herder isn’t interested in demonstrating the superiority of any of these national groups. Berlin describes Herder rather endearingly as "the father, the ancestor, of all those travellers, all those amateurs, who go round the world ferreting out all kinds of forgotten forms of life, delighting in everything that is peculiar, everything that is odd, everything that is native, everything that is untouched":

> Herder is one of those not very many thinkers in the world who really do absolutely adore things for being what they are, and do not condemn them for not being something else. For Herder everything is delightful. He is delighted by Babylon and he is delighted by Assyria, he is delighted by India and he is delighted by Egypt. He thinks well of the Greeks, he thinks well of the Middle Ages, he thinks well of the eighteenth century, he thinks well of almost everything except the immediate environment of his own time and place. If there is anything which Herder dislikes it is the elimination of one culture by another. He does not like Julius Caesar because Julius Caesar trampled on a lot of Asiatic cultures, and we shall now not know what the Cappadocians were really after. He does not like the Crusades, because the Crusades damaged the Byzantines, or the Arabs, and these cultures have every right to the richest and fullest self-expression, without the trampling feet of a lot of imperialist knights. He disliked every form of violence, coercion and the swallowing of one culture by another, because he wants everything to be what it is as much as it possibly can.

Unfortunately the next person to take up this idea of national identity was Johann Fichte. Fichte was a philosopher following in the tradition of Kant. Kant himself was very much not a romantic:

> He disliked everything that was rhapsodical or confused in any respect. He liked logic and he liked rigour. He regarded those who objected to these qualities as simply mentally indolent. He said that logic and rigour were difficult exercises of the human mind, and that it was customary for those who found these things too difficult to invent objections of a different type.

Still, Kant influenced romantic thinking through his ideas on human freedom.

> One of the propositions about which he was convinced was that every man as such is aware of the difference between, on the one hand, inclinations, desires, passions, which pull at him from outside, which are part of his emotional or sensitive or empirical nature; and on the other hand the notion of duty, of obligation to do what is right, which often came into conflict with desire for pleasure and with inclination. > > In the case of Kant it became an obsessive central principle. Man is man, for Kant, only because he chooses. The difference between man and the rest of nature, whether animal or inanimate or vegetable, is that other things are under the law of causality, other things follow rigorously some kind of foreordained schema of cause and effect, whereas man is free to choose what he wishes.

Fichte had some variant on Kant’s ideas about freedom and the will – I’m unaware of the details but it certainly seems to involve getting very excited about it:

> ‘At the mere mention of the name freedom’, says Fichte, ‘my heart opens and flowers, while at the word necessity it contracts painfully.’

He combined his conception of freedom with Herder’s strand of nationalism to get a much more virulent, aggressive kind, involving the struggle of nations to become free:

> Gradually, after Napoleon’s invasions and the general rise of nationalist sentiment in Germany, Fichte began thinking that perhaps what Herder said of human beings was true, that a man was made a man by other men, that a man was made a man by education, by language… So, gradually, he moved from the notion of the individual as an empirical human being in space to the notion of the individual as something larger, say a nation, say a class, say a sect. Once you move to that, then it becomes its business to act, it becomes its business to be free, and for a nation to be free means to be free of other nations, and if other nations obstruct it, it must make war… > > So Fichte ends as a rabid German patriot and nationalist. If we are a free nation, if we are a great creator engaged upon creating those great values which in fact history has imposed upon us, because we happen not to have been corrupted by the great decadence which has fallen upon the Latin nations; if we happen to be younger, healthier, more vigorous than those decadent peoples (and here Francophobia emerges again) who are nothing but the debris of what was once no doubt a fine Roman civilisation – if that is what we are, then we must be free at the expense of no matter what, and therefore, since the world cannot be half slave and half free, we must conquer the others, and absorb them into our texture.

The grounding of knowledge in action

Fichte’s emphasis on action in the world also shows up in his view of knowledge:

> Life does not begin with disinterested contemplation of nature or of objects. Life begins with action. Knowledge is an instrument, as afterwards William James and Bergson and many others were to repeat; knowledge is simply an instrument provided by nature for the purpose of effective life, of action; knowledge is knowing how to survive, knowing what to do, knowing how to be, knowing how to adapt things to our use, knowing, in other words, how to live (and what to do in order not to perish), in some unawakened, semi-instinctive fashion. > > … Because I live in a certain way, things appear to me in a certain fashion: the world of a composer is different from the world of a butcher; the world of a man in the seventeenth century is different from the world of a man in the twelfth century. There may be certain things which are common, but there are more things, or more important things at any rate, which, for him, are not.

I like this a lot, and it’s fascinating to see an earlier version of ideas that crop up later in the Pragmatists and then also in Heidegger and Wittgenstein. It certainly adds important ideas that Enlightenment views of detached inquiry were missing. But then the world-spirit stuff starts coming in. It starts well, with a sort of Merleau-Ponty-like thing about being constrained by the body…

> Fichte began by talking about individuals, then he asked himself what an individual was, how one could become a perfectly free individual. One obviously cannot become perfectly free so long as one is a three-dimensional object in space, because nature confines one in a thousand ways.

… but then quickly descends into whatever this is:

> Therefore the only perfectly free being is something larger than man, it is something internal – although I cannot force my body, I can force my spirit. Spirit for Fichte is not the spirit of an individual man, but something which is common to many men, and it is common to many men because each individual spirit is imperfect, because it is to some extent hemmed in and confined by the particular body which it inhabits. But if you ask what pure spirit is, pure spirit is some kind of transcendent entity (rather like God), a central fire of which we are all individual sparks – a mystical notion which goes back at least to Boehme.


I wrote a short notebook post last year where I compared two types of symbolism: conventions like ‘red means stop’, which have been carefully pruned to have one and only one meaning, and ‘poetic’, ‘mythic’ symbolism like the medieval rose, with thick multilayered meanings.

I got this from McGilchrist’s The Master and His Emissary, but it turns out that he got it from The Roots of Romanticism and I didn’t notice at the time. Berlin lays out the same distinction. It’s this second, poetic type that’s important to the romantics:

> Symbolism is central in all romantic thought: that has always been noticed by all critics of the movement. Let me try to make it as clear as I am able, although I do not claim to understand it entirely, because, as Schelling very rightly says, romanticism is truly a wild wood, a labyrinth in which the only guiding thread is the will and the mood of a poet…. > > There are two kinds of symbols, to put it at its very simplest. There are conventional symbols and symbols of a somewhat different kind. Conventional symbols offer no difficulty… Red and green traffic lights mean what they mean by convention. > > … But there are obviously symbols not quite of this kind… if you ask, for example, in what sense a national flag waving in the wind, which arouses emotions in people’s breasts, is a symbol, or in what sense the Marseillaise is a symbol… the answer will be that what these things symbolise is literally not expressible in any other way.

This second type of symbol feels inexhaustible; the more shades of meaning you extricate, the more you find. This is why they preoccupied the romantics, who were fascinated by the abundance and surplus of the world.

Nostalgia and paranoia

Berlin then talks about how this inexhaustibility leads to ‘two quite interesting and obsessive phenomena which are then very present both in nineteenth- and in twentieth-century thought and feeling.’ The first is nostalgia, the yearning for past meaning slipping from our fingers:

> The nostalgia is due to the fact that, since the infinite cannot be exhausted, and since we are seeking to embrace it, nothing that we do will ever satisfy us. > > … Your relation to the universe is inexpressible. This is the agony, this is the problem. This is the unending Sehnsucht, this is the yearning, this is the reason why we must go to distant countries, this is why we seek for exotic examples, this is why we travel in the East and write novels about the past, this is why we indulge in all manner of fantasies.

Then there is a darker version of this obsession, where the deep submerged currents of the world are out to get us.

> There is an optimistic version of romanticism in which what the romantics feel is that by going forward, by expanding our nature, by destroying the obstacles in our path, whatever they may be… we are liberating ourselves more and more and allowing our infinite nature to soar to greater and greater heights and become wider, deeper, freer, more vital, more like the divinity towards which it strives. But there is another, more pessimistic version of this, which obsesses the twentieth century to some extent. There is a notion that although we individuals seek to liberate ourselves, yet the universe is not to be tamed in this easy fashion. There is something behind, there is something in the dark depths of the unconscious, or of history; there is something, at any rate, not seized by us which frustrates our dearest wishes.

This paranoia shows up in attempts to understand the consequences of the French Revolution, where the world had avenged itself on all the Enlightenment bluechecks who had tried to tame it with reason:

> … what the Revolution led everybody to suspect was that perhaps not enough was known: the doctrines of the French philosophes, which were supposedly a blueprint for the alteration of society in any desired direction, had in fact proved inadequate. Therefore, although the upper portion of human social life was visible – to economists, psychologists, moralists, writers, students, every kind of scholar and observer of the facts – that portion was merely the tip of some huge iceberg of which a vast section was sunk beneath the ocean. This invisible section had been taken for granted a little too blandly, and had therefore avenged itself by producing all kinds of exceedingly unexpected consequences.

This paranoia can inspire great art, or take ‘all kinds of other, sometimes much cruder, forms’:

> It takes the form, for example, of looking for all kinds of conspiracies in history. People begin to think that perhaps history is formed by forces over which we have no control. Someone is at the back of it all: perhaps the Jesuits, perhaps the Jews, perhaps the Freemasons.

I said I wasn’t going to explicitly link any of this back to Current Year, but at this point the echoes are not subtle. I’ll move on quickly to the final section, where I talk about how Berlin ties these disparate ideas together.

Comfort with contradiction

I can’t resist quoting one more chunk from the introductory chapter, an inspired prose poem on the wild variety of Romantic life and thought:

> It is extreme nature mysticism, and extreme anti-naturalist aestheticism. It is energy, force, will, life, étalage du moi; it is also self-torture, self-annihilation, suicide. It is the primitive, the unsophisticated, the bosom of nature, green fields, cow-bells, murmuring brooks, the infinite blue sky. No less, however, it is also dandyism, the desire to dress up, red waistcoats, green wigs, blue hair, which the followers of people like Gérard de Nerval wore in Paris at a certain period. It is the lobster which Nerval led about on a string in the streets of Paris. It is wild exhibitionism, eccentricity, it is the battle of Ernani, it is ennui, it is taedium vitae, it is the death of Sardanopolis, whether painted by Delacroix, or written about by Berlioz or Byron. It is the convulsion of great empires, wars, slaughter and the crashing of worlds.

It’s a lot of other things besides. (There’s like a page more of this on either side… I have to stop somewhere.) What’s the connection between them?

Berlin makes the case that it’s precisely this comfort with contradiction that’s new in Romantic thought. The Romantics are free from the oppressive need to make any sort of consistent global sense out of their experience, so they can layer together as many weird ideas as they like.

This is a huge departure from Enlightenment thought, which expected coherent theories:

> There are three propositions, if we may boil it down to that, which are, as it were, the three legs upon which the whole Western tradition rested. They are not confined to the Enlightenment, although the Enlightenment offered a particular version of them, transformed them in a particular manner. The three principles are roughly these. First, that all genuine questions can be answered, that if a question cannot be answered it is not a question. We may not know what the answer is, but someone else will. > > … The second proposition is that all these answers are knowable, that they can be discovered by means which can be learnt and taught to other persons… > > … The third proposition is that all the answers must be compatible with one another, because, if they are not compatible, then chaos will result.

Viewed through this lens, the ideas of the previous section come together as a way of navigating life without any absolute set of rules to act as a guide. Particularism is popular because details matter more than unreliable theories. Expressivism, because the important thing is to make something personally meaningful from the fragments available to you. Action is vital because there is no ultimate theory detached from individual understanding, so everyone must navigate as well as they can from their current starting point, enmeshed in the local culture. Fixed axioms are unavailable, but symbols can still work as potent ordering principles, natural clustering points in the web of meanings. And paranoia is a natural response to the other, inconsistent strands that never be completely assimilated and that may come to harm you.

> There is a collision here of what Hegel afterwards called ‘good with good’. It is due not to error, but to some kind of conflict of an unavoidable kind, of loose elements wandering about the earth, of values which cannot be reconciled. What matters is that people should dedicate themselves to these values with all that is in them.

This makes a lot of sense to me, but there are still things that I’m confused by. This inconsistent patchwork somehow had to be built on top of a Christian worldview, with all the ultimate grounding in God’s truth that that implies. This was some time before the deeper collapse of systems of meaning in the late nineteenth and early twentieth century, so I would expect some sort of counterbalancing pull towards coherence, and I didn’t get a sense of how that worked from Berlin’s book. I maybe got a glimpse of them with Fichte’s talk about the world-spirit as a transcendent entity, ‘a central fire of which we are all individual sparks’. So maybe there was some nod to consistency at this inaccessible universal level, but an understanding that individual people or nations couldn’t achieve it?

Maybe this unravelling of systems of meaning started earlier than I imagined? I recently came across the following quote:

> Thus all round, the intellectual lightships had broken from their moorings, and it was a then a new and trying experience. The present generation which has grown up in an open spiritual ocean, which has got used to it and has learned to swim for itself, will never know what it was to find the lights all drifting, the compasses all awry, and nothing left to steer by except the stars.

This comes from the historian and novelist James Anthony Froude, writing about his own crisis of faith. I was surprised to learn that this was in the 1840s, not say the 1890s. So at least some of the breakdown was happening quite early.

Of course, I’m relying on a secondary source, so another option is that Berlin was writing a long way into the process of fragmentation and so maybe he reads more of this into the Romantics than was actually there. Still, it does look like a lot of resources for navigating groundlessness were available in Western culture earlier than I realised. It makes sense that we’d be reaching for this bag of ideas in times as weird as these.

Note: This review started as a series of three newsletter entries in a kind of lazy quotes-and-notes format. I wanted to have a more polished single post that I could refer back to, and that turned out to be more work than I expected. I ended up changing the structure quite a lot, shifting from following the chronological order of events to focusing more on major ideas of the movement, which has come at the expense of covering the people involved in as much detail. So if you’re really interested, and can stand a few weird tangents about Philip Pullman’s influences and the sinking of the Titanic, the newsletter versions could be worth a look too.

Worse than quantum physics, part 2

This is Part 2 of a two part explanation — Part 1 is here. It won’t make much sense on its own!

In this post I’m going to get into the details of the analogy I set up last time. So far I’ve described how the PR box is ‘worse than quantum physics’ in a specific sense: it violates the CHSH inequality more strongly than any quantum system, pushing past the Tsirelson bound of 2\sqrt{2} to reach the maximum possible value of 4. I also introduced Piponi’s box example, another even simpler ‘worse than quantum physics’ toy system.

This time I’ll explain the connection between Piponi’s box and qubit phase space, and then show that a similar CHSH-inequality-like ‘logical Bell inequality’ holds there too. In this case the quantum system has a Tsirelson-like bound of \sqrt{3}, interestingly intermediate between the classical limit of 1 and the maximum possible value of 3 obtained by Piponi’s box. Finally I’ll dump a load of remaining questions into a Discussion section in the hope that someone can help me out here.

A logical Bell inequality for the Piponi box

Here’s the table from the last post again:

Measurement T F
a 1 0
b 1 0
a \oplus b 1 0

As with the PR box, we can use the yellow highlighted cells in the table to get a version of Abramsky and Hardy’s logical Bell inequality \sum p_i \leq N-1, this time with N = 3 cells. These cells correspond to the three incompatible propositions a, b, a\oplus b, with combined probability \sum p_i = 3, violating the inequality by the maximum amount.

Converting to expected values E_i = 2p_i -1 gives

\sum E_i = 3 > N-2 = 1.

So that’s the Piponi box ↔ PR box part of the analogy sorted. Next I want to talk about the qubit phase space ↔ Bell state part. But first it will be useful to rewrite the table of Piponi box results in a way that makes the connection to qubit phase space more obvious:

The four boxes represent the four ‘probabilities’ P(a,b) introduced in the previous post, which can be negative. To recover the values in the table, add up rows, columns or diagonals of the diagram. For example, to find p(\lnot a), sum up the left hand column:

p(\lnot a) = P(\lnot a, b) + P(\lnot a, \lnot b) = \frac{1}{2} - \frac{1}{2} = 0.

Or to find p(a \oplus b), sum up the top-left-to-bottom-right diagonal:

p(a \oplus b) = P(a, \lnot b) + P(\lnot a, b) = \frac{1}{2} + \frac{1}{2} = 1.

I made the diagram below to show how this works in general, and now I’m not sure whether that was a good idea. It’s kind of busy and looking at the example above is probably a lot more helpful. On the other hand, I’ve gone through the effort of making it now and someone might find it useful, so here it is:

Qubit phase space

That’s the first part of the analogy done, between the PR box and Piponi’s box model. Now for the second part, between the CHSH system and qubit phase space. I want to show that the same set of measurements that I used for Piponi’s box also crops up in quantum mechanics as measurements on the phase space of a single qubit. This quantum case also violates the classical bound of \sum E_i = 1, but, as with the Tsirelson bound for an entangled qubit system, it doesn’t reach the maximum possible value. Instead, it tops out at \sum E_i = \sqrt{3}.

The measurements a, b, a\oplus b can be instantiated for a qubit in the following way. For a qubit |\psi\rangle, take

p(a)  = \langle \psi | Q_z | \psi \rangle ,

p(b) = \langle \psi | Q_x | \psi \rangle ,

with Q_i  = \frac{1}{2}(I-\sigma_i) for the Pauli matrices \sigma_i. The a\oplus b diagonal measurements then turn out to correspond to

p(a\oplus b) = \langle \psi | Q_y | \psi \rangle ,

completing the set of measurements.

This is the qubit phase space I described in my second post on negative probability – for more details on how this works and how the corresponding P(a,b)s are calculated, see for example the papers by Wootters on finite-state Wigner functions and Picturing Qubits in Phase Space.

As a simple example, in the case of the qubit state |0\rangle these measurements give

p(a) = 0

p(b) = \frac{1}{2}

p(a\oplus b) = \frac{1}{2},

leading to the following phase space:

A Tsirelson-like bound for qubit phase space

Now, we want to find the qubit state |\psi\rangle which gives the largest value of \sum p_i. To do this, I wrote out |\psi\rangle in the general Bloch sphere form |\psi\rangle = \cos(\theta / 2) |0\rangle + e^{i\phi} \sin(\theta / 2) |1\rangle and then maximised the value of the highlighted cells in the table:

\sum p_i = p(a) + p(b) + p(a\oplus b) = \frac{3}{2} - \frac{1}{2}(\cos\theta + \sin\theta\cos\phi + \sin\theta\sin\phi )

This is a straightforward calculation but the details are kind of fiddly, so I’ve relegated them to a separate page (like the boring technical appendix at the back of a paper, but blog post style). Anyway the upshot is that this quantity is maximised when \phi = \frac{5\pi}{4} , \sin\theta = \frac{\sqrt{2}}{\sqrt{3}} and \cos\theta = -\frac{1}{\sqrt{3}}, leading to the following table:

Measurement T F
a \frac{1}{2}\left(1 + \frac{1}{\sqrt{3}} \right) 0
b \frac{1}{2}\left(1 + \frac{1}{\sqrt{3}} \right) 0
a \oplus b \frac{1}{2}\left(1 + \frac{1}{\sqrt{3}} \right) 0

The corresponding qubit phase space, if you’re interested, is the following:

Notice the negative ‘probability’ in the bottom left, with a value of around -0.183. This is in fact the most negative value possible for qubit phase space.

This time, adding up the numbers in the yellow-highlighted cells of the table gives

\sum p_i = \frac{3}{2}\left(1 + \frac{1}{\sqrt{3}} \right),

or, in terms of expectation values,

\sum E_i = \sum (2p_i - 1) =   \sqrt{3}.

So \sqrt{3} is our Tsirelson-like bound for this system, in between the classical limit of 1 and the Piponi box value of 3.

Further questions

As with all of my physics blog posts, I end up with more questions than I started with. Here are a few of them:

Is this analogy already described in some paper somewhere? If so, please point me at it!

Numerology. Why \sqrt{3} and not some other number? As a first step, I can do a bit of numerology and notice that \sqrt{3} = \sqrt{N/2}, where N=6 is the number of cells in the table, and that this rule also fits the CHSH bound of 2\sqrt{2}, where there are N=16 cells.

I can also try this formula on the Mermin example from my Bell post. In that case N=36, so the upper bound implied by the rule would be 3\sqrt{2} … which turns out to be correct. (I didn’t find the upper bound in the post, but you can get it by putting \tfrac{1}{8}(2+\sqrt 2) in all the highlighted cells of the table, similarly to CHSH.)

The Mermin example is close enough to CHSH that it’s not really an independent data point for my rule, but it’s reassuring that it still fits, at least.

What does this mean? Does it generalise? I don’t know. There’s a big literature on different families of Bell results and their upper bounds, and I don’t know my way around it.

Information causality. OK, playing around with numbers is fine, but what does it mean conceptually? Again, I don’t really know my way around the literature. I know there’s a bunch of papers, starting from this one by Pawlowski et al, that introduces a physical principle called ‘information causality’. According to that paper, this states that, for a sender Alice and a receiver Bob,

> the information gain that Bob can reach about the previously unknown to him data set of Alice, by using all his local resources and m classical bits communicated by Alice, is at most m bits.

This principle somehow leads to the Tsirelson bound… as you can see I have not looked into the details yet. This is probably what I should do next. It’s very much phrased in terms of having two separated systems, so I don’t know whether it can be applied usefully in my case of a single qubit.

If you have any insight into any of these questions, or you notice any errors in the post, please let me know in the comments below, or by email.

Worse than quantum physics

I’m still down the rabbithole of thinking way too much about quantum foundations and negative probabilities, and this time I came across an interesting analogy, which I will attempt to explain in this post and the next one. This should follow on nicely from my last post, where I talked about one of the most famous weird features of quantum physics, the violation of the Bell inequalities.

It’s not necessary to read all of that post to understand this one, but you will need to be somewhat familiar with the Bell inequalities (and the CHSH inequality in particular) from somewhere else. For the more technical parts, you’ll also need to know a little bit about Abramsky and Hardy’s logical Bell formulation, which I also covered in the last post. But the core idea probably makes some kind of sense without that background.

So, in that last post I talked about the CHSH inequality and how quantum physics violates the classical upper limit of 2. The example I went through in the post is designed to make the numbers easy, and reaches a value of 2.5, but it’s possible to pick a set of measurements that pushes it further again, to a maximum of 2\sqrt{2} (which is about 2.828). This value is known as the Tsirelson bound.

This maximum value is higher than anything allowed by classical physics, but doesn’t reach the absolute maximum that’s mathematically attainable. The CHSH inequality is normally written something like this:

| E(a,b) + E(\bar{a}, b) + E(a, \bar{b}) - E(\bar{a}, \bar{b}) | \leq 2.

Each of the Es has to be between -1 and +1, so if it was possible to always measure +1 for the first three and -1 for the last one you’d get 4.

This kind of hypothetical ‘superquantum correlation’ is interesting because of the potential to illuminate what’s special about the Tsirelson bound – why does quantum mechanics break the classical limit, but not go all the way? So systems that are ‘worse than quantum physics’ and push all the way to 4 are studied as toy models that can hopefully illuminate something about the constraints on quantum mechanics. The standard example is known as the Popescu-Rohrlich (PR) box, introduced in this paper.

This sounds familiar…

I was reading up on the PR box a while back, and it reminded me of something else I looked into. In my blog posts on negative probability, I used a simple example due to Dan Piponi. This example has the same general structure as measurements on a qubit, but it’s also ‘worse than quantum mechanics’, in the sense that one of the probabilities is more negative than anything allowed in quantum mechanics. Qubits are somewhere in the middle, in between classical systems and the Piponi box.

I immediately noticed the similarity, but at first I thought it was probably something superficial and didn’t investigate further. But after learning about Abramsky and Hardy’s logical formulation of the Bell inequalities, which I covered in the last post, I realised that there was an exact analogy.

This is really interesting to me, because I had no idea that there was any sort of Tsirelson bound equivalent for a single particle system. I’ve already spent quite a bit of time in the last couple of years thinking about the phase space of a single qubit, because it seems to me that a lot of essential quantum weirdness is hidden in there already, before you even consider entanglement with a second qubit – you’ve already got the negative probabilities, after all. But I wasn’t expecting this other analogy to turn up.

I haven’t come across this result in the published literature. But I also haven’t done anything like a thorough search, and it’s quite difficult to because Piponi’s example is in a blog post, rather than a paper. So maybe it’s new, or maybe it’s too simple to write down and stuck in the ghost library, or maybe it’s all over the place and I just haven’t found it yet. I really don’t know, and it seemed like the easiest thing was to just write it up and then try and find out once I had something concrete to point at. I am convinced it hasn’t been written up at anything like a blog-post-style introductory level, so hopefully this can be useful however it turns out.

Post structure

I decided to split this argument into two shorter parts and post them separately, to make it more readable. This first part is just background on the Tsirelson bound and the PR box – there’s nothing new here, but it was useful for me to collect the background I need in one place. I also give a quick description of Piponi’s box model.

In the second post, I’ll move on to explaining the single qubit analogy. This is the interesting bit!

The Tsirelson bound: Mermin’s machine again

To illustrate how Tsirelson’s bound is attained, I’ll go back to Mermin’s machine from the last post. I’ll use the same basic setup as before, but move the settings on the detectors:

This time the two settings on each detector are at right angles to each other, and the right hand detector settings are rotated 45 degrees from the left hand detector. As before, quantum mechanics says that the probabilities of different combinations of lights flashing will obey

p(T,T) = p(F,F) = \frac{1}{2}\cos^2\left(\frac{\theta}{2}\right),

p(T,F) = p(F,T) = \frac{1}{2}\sin^2\left(\frac{\theta}{2}\right),

where \theta is the angle between the detector settings. The numbers are more hassly than Mermin’s example, which was picked for simplicity – here’s the table of probabilities:

Dial setting (T,T) (T,F) (F,T) (F,F)
ab \tfrac{1}{8}(2+\sqrt 2) \tfrac{1}{8}(2-\sqrt 2) \tfrac{1}{8}(2-\sqrt 2) \tfrac{1}{8}(2+\sqrt 2)
ab' \tfrac{1}{8}(2-\sqrt 2) \tfrac{1}{8}(2+\sqrt 2) \tfrac{1}{8}(2+\sqrt 2) \tfrac{1}{8}(2-\sqrt 2)
a'b \tfrac{1}{8}(2+\sqrt 2) \tfrac{1}{8}(2-\sqrt 2) \tfrac{1}{8}(2-\sqrt 2) \tfrac{1}{8}(2+\sqrt 2)
a'b' \tfrac{1}{8}(2+\sqrt 2) \tfrac{1}{8}(2-\sqrt 2) \tfrac{1}{8}(2-\sqrt 2) \tfrac{1}{8}(2+\sqrt 2)

Then we follow the logical Bell procedure of the last post, take a set of mutually contradictory propositions (the highlighted cells) and find their combined probability. This gives \sum p_i = 2+\sqrt 2, or, converting to expectation values E_i = 2p_i - 1,

\sum E_i = 2\sqrt 2 .

This is the Tsirelson bound.

The PR box

The idea of the PR box is to get the highest violation of the inequality possible, by shoving all of the probability into the highlighted cells, like this:

Dial setting (T,T) (T,F) (F,T) (F,F)
ab 1/2 0 0 1/2
a\bar{b} 0 1/2 1/2 0
\bar{a}b 1/2 0 0 1/2
\bar{a}\bar{b} 1/2 0 0 1/2

This time, adding up all the highlighted boxes gives the maximum \sum E_i = 4 .


This is kind of an aside in the context of this post, but the original motivation for the PR box was to demonstrate that you could push past the quantum limit while still not allowing signalling between the two devices: if you only have access the left hand box, for example, you can’t learn anything about the right hand box’s dial setting. Say you set the left hand box to dial setting a. If the right hand box was set to b you’d end up measuring T with a probability of

p(T,T| a,b) + p(T,F| a,b) = \frac{1}{2} + 0 = \frac{1}{2}.

If the right hand box was set to \bar{b} instead you’d still get \frac{1}{2}:

p(T,T| a,\bar{b}) + p(T,F| a,\bar{b}) = 0 + \frac{1}{2} = \frac{1}{2}.

The same conspiracy holds if you set the left hand box to \bar{a}, so whatever you do you can’t find out anything about the right hand box.

Negative probabilities

Another interesting feature of the PR box, which will be directly relevant here, is the connection to negative probabilities. Say you want to explain the results of the PR box in terms of underlying probabilities P(a,a',b,b') for all of the settings at once. This can’t be done in terms of normal probabilities, which is not surprising: this property of having consistent results independent of the measurement settings you choose is exactly what’s broken down for non-classical systems like the CHSH system and the PR box.

However you can reproduce the results if you allow some negative probabilities. In the case of the PR box, you end up with the following:

P(T,T,T,T) = \frac{1}{2}

P(T,T,T,F) = 0

P(T,T,F,T) = -\frac{1}{2}

P(T,T,F,F) = 0

P(T,F,T,T) = 0

P(T,F,T,F) = 0

P(T,F,F,T) = \frac{1}{2}

P(T,F,F,F) = 0

P(F,T,T,T) = -\frac{1}{2}

P(F,T,T,F) = \frac{1}{2}

P(F,T,F,T) = \frac{1}{2}

P(F,T,F,F) = 0

P(F,F,T,T) = 0

P(F,F,T,F) = 0

P(F,F,F,T) = 0

P(F,F,F,F) = 0

(I got these from Abramsky and Brandenburger’s An Operational Interpretation of Negative Probabilities and No-Signalling Models.) To get back the probabilities in the table above, sum up all relevant Ps for each dial setting. As an example, take the top left cell of the table above. To get the probability of (T,T) for dial setting (a,b), sum up all cases where a and b are both T:

P(T,T,T,T) + P(T,T,T,F) + P(T,F,T,T) + P(T,F,T,F) = \frac{1}{2}

In this way we recover the values of all the measurements in the table – it’s only the Ps that are negative, not anything we can actually measure. This feature, along with the way that the number -\tfrac{1}{2} crops up specifically, is what reminded me of Piponi’s blog post.

Piponi’s box model

The device in Piponi’s example is a single box containing two bits a and b, and you can make one of three measurements: the value of a, the value of b, or the value of a \oplus b. The result is either T or F, with probabilities that obey the following table:

Measurement T F
a 1 0
b 1 0
a \oplus b 1 0

These measurements are inconsistent and can’t be described with any normal probabilities P(a,b), but, as with the PR box, they can with negative probabilities:

P(T,T) = \frac{1}{2}

P(T,F) = \frac{1}{2}

P(F,T) = \frac{1}{2}

P(F,F) = -\frac{1}{2}

For example, the probability of measuring a\oplus b and getting F is

P(T,T) + P(F,F) = \frac{1}{2} - \frac{1}{2} = 0,

as in the table above.

Notice that -\frac{1}{2} crops up again! The similarities to the PR box go deeper, though. The PR box is a kind of extreme version of the CHSH state of two entangled qubits – same basic mathematics but pushing the correlations up higher. Analogously, Piponi’s box is an extreme version of the phase space for a single qubit. In both cases, quantum mechanics is perched intriguingly in the middle between classical mechanics and these extreme systems. I’ll go through the details of the analogy in the next post.

Bell’s theorem and Mermin’s machine

Anybody who’s not bothered by Bell’s theorem has to have rocks in his head.

— ‘A distinguished Princeton physicist’, as told to David Mermin

This post is a long, idiosyncratic discussion of the Bell inequalities in quantum physics. There are plenty of good introductions already, so this is a bit of a weird thing to spend my time writing. But I wanted something very specific, and couldn’t find an existing version that had all the right pieces. So of course I had to spend far too much time making one.

My favourite introduction is Mermin’s wonderful Quantum Mysteries for Anyone. This is an absolute classic of clear explanation, and lots of modern pop science discussions derive from it. It’s been optimised for giving a really intense gut punch of NOTHING IN THE WORLD MAKES SENSE ANY MORE, which I’d argue is the main thing you want to get out of learning about the Bell inequalities.

However, at some point if you get serious you’ll want to actually calculate things, which means you’ll need to make the jump from Mermin’s version to the kind of exposition you see in a textbook. The most common modern version of the Bell inequalities you’ll see is the CHSH inequality, which looks like this:

| E(a,b) + E(\bar{a}, b) + E(a, \bar{b}) - E(\bar{a}, \bar{b}) | < 2

(It doesn’t matter what all of that means, at the moment… I’ll get to that later.) The standard sort of derivations of this tend to involve a lot of fussing with algebraic rearrangements and integrals full of \lambdas and so forth. The final result is less of a gut punch and more of a diffuse feeling of unease: "well I guess this number has to be between -2 and 2, but it isn’t".

This feels like a problem to me. There’s a 1929 New Yorker cartoon which depicts ordinary people in the street walking around dumbstruck by Einstein’s theory of general relativity. This is a comic idea because the theory was famously abstruse (particularly back then when good secondary explanations were thin on the ground). But the Bell inequalities are accessible to anyone with a very basic knowledge of maths, and weirder than anything in relativity. I genuinely think that everyone should be walking down the street clutching their heads in shock at the Bell inequalities, and a good introduction should help deliver you to this state. (If you don’t have rocks in your head, of course. In that case nothing will help you.)

It’s also a bit of an opaque black box. For example, why is there a minus sign in front of one of the Es but not the others? I was in a discussion group a few years back with a bunch of postdocs and PhD students, all of us with a pretty strong interest in quantum foundations, and CHSH came up at some point. None of us had much of a gut sense for what that minus sign was doing… it was just something that turned up during some algebra.

I wanted to trace a path from Mermin’s explanation to the textbook one, in the hope of propagating some of that intuitive force forward. I wrote an early draft of the first part of this post for a newsletter in 2018 but couldn’t see how to make the rest of it work, so I dropped it. This time I had a lot more success using some ideas I learned in the meantime. I ended up taking a detour through a third type of explanation, the ‘logical Bell inequalities’ approach of Abramsky and Hardy. This is a general method that can be used on a number of other similar ‘no-go theorems’, not just Bell’s original. It gives a lot more insight into what’s actually going on (including that pesky minus sign). It’s also surprisingly straightforward: the main result is a few steps of propositional logic.

That bit of propositional logic is the most mathematically involved part of this post. The early part just requires some arithmetic and the willingness to follow what Mermin calls ‘a simple counting argument on the level of a newspaper braintwister’. No understanding of the mathematics of quantum theory is needed at all! That’s because I’m only talking about why the results of quantum theory are weird, and not how the calculations that produce those results are done.

If you also want to learn to do the calculations, starting from a basic knowledge of linear algebra and complex numbers, I really like Michael Nielsen and Andy Matuschak’s Quantum Country, which covers the basic principles of quantum mechanics and also the Bell inequalities. You’d need to do the ‘Quantum computing for the very curious’ part, which introduces a lot of background ideas, and then the ‘Quantum mechanics distilled’ part, which has the principles and the Bell stuff.

There’s also nothing about how the weirdness should be interpreted, because that is an enormous 90-year-old can of rotten worms and I would like to finish this post some time in my life 🙂

Mermin’s machine

So, on to Mermin’s explanation. I can’t really improve on it, and it would be a good idea to go and read that now instead, and come back to my version afterwards. I’ve repeated it here anyway though, partly for completeness and partly because I’ve changed some notation and other details to mesh better with the Abramsky and Hardy version I’ll come to later.

(Boring paragraph on exactly what I changed, skip if you don’t care: I’ve switched Mermin’s ‘red’ and ‘green’ to ‘true’ and ‘false’, and the dial settings from 1,2,3 on both sides to a, a', a'' on the left side and b, b', b'' on the right side. I’ve also made one slightly more substantive change. Mermin explains at the end of his paper that in his setup, ‘One detector flashes red or green according to whether the measured spin is along or opposite to the field; the other uses the opposite color convention’. I didn’t want to introduce the complication of having the two detectors with opposite wiring, and have made them both respond the same way, flashing T for along the field and F for opposite. But I also wanted to keep Mermin’s results. To do that I had to change the dial positions of the right hand dial, so that a is opposite b, a' is opposite b', and a'' is opposite b''. )

Anyway, Mermin introduces the following setup:

The machine in the middle is the source. It fires out some kind of particle – photons, electrons, frozen peas, whatever. We don’t really care how it works, we’ll just be looking at why the results are weird.

The two machines on the right and left side are detectors. Each detector has a dial with three settings. On the left they’re labelled a, a' and a''. On the right, they’re b, b' and b''.

On the top of each are two lights marked T and F for true and false. (Again, we don’t really care what’s true or false, we’re keeping everything at a kind of abstract, operational level and not going into the practical details. It’s just two possible results of a measurement.)

It’s vital to this experiment that the two detectors cannot communicate at all. If they can, there’s nothing weird about the results. So assume that a lot of work has gone into making absolutely sure that the detectors are definitely not sharing information in any way at all.

Now the experiment just consists of firing out pairs of particles, one to each detector, with the dials set to different values, and recording whether the lights flash red or green. So you get a big list of results of the form

ab'TF, a''bFT, a'b'FF, ...

The second important point, other than the detectors not being able to communicate, is that you have a free choice of setting the dials. You can set them both beforehand, or when the particles are both ‘in flight’, or even set the right hand dial after the left hand detector has already received its particle but before the right hand particle gets there. It doesn’t matter.

Now you do like a million billion runs of this experiment, enough to convince you that the results are not some weird statistical fluctuation, and analyse the results. You end up with the following table:

Dial setting (T,T) (T,F) (F,T) (F,F)
ab 1/2 0 0 1/2
ab' 1/8 3/8 3/8 1/8
ab'' 1/8 3/8 3/8 1/8
a'b 1/8 3/8 3/8 1/8
a'b' 1/2 0 0 1/2
a'b'' 1/8 3/8 3/8 1/8
a''b 1/8 3/8 3/8 1/8
a''b' 1/8 3/8 3/8 1/8
a''b'' 1/2 0 0 1/2

Each dial setting has a row, and the entries in that row give the probabilities for getting the different results. So for instance if you set the dials to a' and b, there’s a 1/8 chance of getting (T,T).

This doesn’t obviously look particularly weird at first sight. It only turns out to be weird when you start analysing the results. Mermin condenses two results from this table which are enough to show the weirdness. The first is:

Result 1: This result relates to the cases where the two dials are set to ab, a'b', or a''b''. In these cases both lights always flash the same colour. So you might get ab TT, ab FF, a'b' TT etc, but never ab TF or a''b'' FT.

This is pretty easy to explain. The detectors can’t communicate, so if they do the same thing it must be something to do with the properties of the particles they are receiving. We can explain it straightforwardly by postulating that each particle has an internal state with three properties, one for each dial position. Each of these takes two possible values which we label T or F. We can write these states as e.g.



where the the entries on the top line refer to the left hand particle’s state when the dial is in the a, a' and a'' positions respectively, and the bottom line refers to the right hand particle’s state when the dial is in the b, b', b'' position.

Result 1 implies that the states of the two particles must always be the same. So the state above is an allowed one, but e.g.




Mermin says:

This hypothesis is the obvious way to account for what happens in [Result 1]. I cannot prove that it is the only way, but I challenge the reader, given the lack of connections between the devices, to suggest any other.

Because the second particle will always have the same state to the first one, I’ll save some typing and just write the first one out as a shorthand. So the first example state will just become TTF.

Now on to the second result. This one covers the remaining options for dial settings, a'b', a''b and the like.

Result 2: For the remaining states, the lights flash the same colour 1/4 of the time, and different colours 3/4 of the time.

This looks quite innocuous on first sight. It’s only when you start to consider how it meshes with Result 1 that things get weird.

(This is the part of the explanation that requires some thinking ‘on the level of a newspaper braintwister’. It’s fairly painless and will be over soon.)

Our explanation for result 1 is that particles in each run of the experiment have an underlying state, and both particles have the same state. Let’s go through the implications of this, starting with the example state TTF.

I’ve enumerated the various options for the dials in the table below. For example, if the left dial is a and the right dial is b', we know that the left detector will light up T and the right will light up T, so the two lights are the same.

Dial setting Lights
ab' same
ab'' different
a'b same
a'b'' different
a''b different
a''b' different

Overall there’s a 1/3 chance of being the same and a 2/3 chance of being different. You can convince yourself that this is also true for all the states with two Ts and an F or vice versa: TTF TFF, TFT, FTT, FTF, FFT.

That leaves TTT and FFF as the other two options. In those cases the lights will flash the same colour no matter what the dial is set to.

So whatever the underlying state is, the chance of the two lights being different is greater than ⅓. But this is incompatible with Result 2, which says that the probability is ¼.

(The thinky part is now done.)

So Results 1 and 2 together are completely bizarre. No assignment of states will work. But this is exactly what happens in quantum mechanics!

You probably can’t do it with frozen peas, though. The details don’t matter for this post, but here’s a very brief description if you want it: the particles should be two spin-half particles prepared in a specific ‘singlet’ state, the dials should connect to magnets that can be oriented in three states at 120 degree angles from each other, and the lights on the detectors measure spin along and opposite to the field. The magnets should be set up so that the state for setting a on the left hand side is oriented at 180 degrees from the state for setting b on the right hand side; similarly a' should be opposite b' and a'' opposite b''. I’ve drawn the dials on the machine to match this. Quantum mechanics then says that the probabilities of the different results are

p(T,T) = p(F,F) = \frac{1}{2}\cos^2{\frac{\theta}{2}}

p(T,F) = p(F,T) = \frac{1}{2}\sin^2{\frac{\theta}{2}}

where \theta is the angle between the magnet states on the left and right sides. This reproduces the numbers in the table above.

Once more with less thinking

Mermin’s argument is clear and compelling. The only problem with it is that you have to do some thinking. There are clever details that apply to this particular case, and if you want to do another case you’ll have to do more thinking. Not good. This is where Abramsky and Hardy’s logical Bell approach comes in. It requires more upfront setup (so actually more thinking in the short term – this section title is kind of a lie, sorry) but can then be applied systematically to all kinds of problems.

This first involves reframing the entries in the probability table in terms of propositional logic. For example, we can write the result (T,F) for (a’,b) as a' \land \lnot b. Then the entries of the table correspond to the probabilities we assign to each statement: in this case, \text{prob}(a' \land \lnot b) = \frac{3}{8}.

Now, look at the following highlighted cells in three rows of the grid:

Dial setting (T,T) (T,F) (F,T) (F,F)
ab 1/2 0 0 1/2
ab' 1/8 3/8 3/8 1/8
ab'' 1/8 3/8 3/8 1/8
a'b 1/8 3/8 3/8 1/8
a'b' 1/2 0 0 1/2
a'b'' 1/8 3/8 3/8 1/8
a''b 1/8 3/8 3/8 1/8
a''b' 1/8 3/8 3/8 1/8
a''b'' 1/2 0 0 1/2

These correspond to the three propositions

\phi_1 = (a\land b) \lor (\lnot a \land\lnot b)

\phi_2 = (a'\land b') \lor (\lnot a' \land\lnot b')

\phi_3 = (a''\land b'') \lor (\lnot a'' \land\lnot b'') ,

which can be written more simply as

\phi_1 = a \leftrightarrow b

\phi_2 = a' \leftrightarrow b'

\phi_3 = a'' \leftrightarrow b''.

where the \leftrightarrow stands for logical equivalence. This also means that a can be substituted for b, and so on, which will be useful in a minute.

Next, look at the highlighted cells in these three rows:

Dial setting (T,T) (T,F) (F,T) (F,F)
ab 1/2 0 0 1/2
ab' 1/8 3/8 3/8 1/8
ab'' 1/8 3/8 3/8 1/8
a'b 1/8 3/8 3/8 1/8
a'b' 1/2 0 0 1/2
a'b'' 1/8 3/8 3/8 1/8
a''b 1/8 3/8 3/8 1/8
a''b' 1/8 3/8 3/8 1/8
a''b'' 1/2 0 0 1/2

These correspond to

\phi_4 = (a\land \lnot b') \lor (\lnot a \land b')

\phi_5 = (a\land \lnot b'') \lor \lnot (a \land b'')

\phi_6 = (a'\land \lnot b'') \lor (\lnot a' \land b'') ,

which can be simplified to

\phi_4 = a \oplus b'

\phi_5 = a \oplus b''

\phi_6 = a' \oplus b''.

where the \oplus stands for exclusive or.

Now it can be shown quite quickly that these six propositions are mutually contradictory. First use the first three propositions to get rid of b , b' and b'', leaving

a \oplus a'

a \oplus a''

a' \oplus a''

You can check that these are contradictory by drawing out the truth table, or maybe just by looking at them, or maybe by considering the following stupid dialogue for a while (this post is long and I have to entertain myself somehow):

Grumpy cook 1: You must have either beans or chips but not both.

Me: OK, I’ll have chips.

Grumpy cook 2: Yeah, and also you must have either beans or peas but not both.

Me: Fine, looks like I’m having chips and peas.

Grumpy cook 3: Yeah, and also you must have either chips or peas but not both.


Me: OK let’s back up a bit. I’d better have beans instead of chips.

Grumpy cook 1: You must have either beans or chips but not both.

Me: I know. No chips. Just beans.

Grumpy cook 2: Yeah, and also you must have either beans or peas but not both.

Me: Well I’ve already got to have beans. But I can’t have them with chips or peas. Got anything else?

Grumpy cook 3: NO! And remember, you must have either chips or peas.

Me: hurls tray

So, yep, the six highlighted propositions are inconsistent. But this wouldn’t necessarily matter, as some of the propositions are only probabilistically true. So you could imagine that, if you carefully set some of them to false in the right ways in each run, you could avoid the contradiction. However, we saw with Mermin’s argument above that this doesn’t save the situation – the propositions have ‘too much probability in total’, in some sense, to allow you to do this. Abramsky and Hardy’s logical Bell inequalities will quantify this vague ‘too much probability in total’ idea.

Logical Bell inequalities

This bit involves a few lines of logical reasoning. We’ve got a set of propositions \phi_i (six of them in this example case, N in general), each with probability p_i. Let P be the probability of all of them happening together. Call this combined statement

\Phi = \bigwedge_i \phi_i.


1 - P = \text{prob}\left( \lnot\Phi\right) = \text{prob}\left(\bigvee_i \lnot\phi_i\right)

where the second equivalence is de Morgan’s law. This is definitely less than the sum of the probabilities of all the \lnot\phi_i s:

1 - P \leq \text{prob} \sum_i (\lnot\phi_i)

= \sum_i (1 - p_i)

= N - \sum_i p_i .

where N is the total number of propositions. Rearranging gives

\sum_i p_i \leq N + P - 1.

Now suppose the \phi_i are jointly contradictory, as in the Mermin example above, so that the combined probability P = 0. This gives the logical Bell inequality

\sum_i p_i \leq N-1 .

This is the precise version of the ‘too much probability’ idea. In the Mermin case, there are six propositions, three with probability 1 and three with probability ¾, which sum to 5.25. This is greater than N-1 = 5, so the inequality is violated.

This inequality can be applied to lots of different setups, not just Mermin’s. Abramsky and Hardy use the CHSH inequality mentioned in the introduction to this post as their first example. This is probably the common example used to introduce Bell’s theorem, though the notation is usually somewhat different. I’ll go though Abramsky and Hardy’s version and then connect it back to the standard textbook notation.

The CHSH inequality

The CHSH experiment only uses two settings on each side, not three. I’ve drawn a ‘CHSH machine’ in the style of Mermin’s machine to illustrate it:

There are two settings a and \bar{a} on the left side, 60 degrees apart. And there are two settings b and \bar{b} on the right side, also 60 degrees apart, with b opposite a. This leads to the following table:

Dial setting (T,T) (T,F) (F,T) (F,F)
ab 1/2 0 0 1/2
a\bar{b} 3/8 1/8 1/8 3/8
\bar{a}b 3/8 1/8 1/8 3/8
\bar{a}\bar{b} 1/8 3/8 3/8 1/8

Now it’s just a case of following the same reasoning as for the Mermin case. The highlighted rows correspond to the propositions

\phi_1 = (a \land b) \lor  \lnot (a \land \lnot b) = a \leftrightarrow b

\phi_2 = (a \land \bar{b}) \lor \lnot (a \land \lnot \bar{b}) = a \leftrightarrow \bar{b}

\phi_3 = (\bar{a} \land b) \lor \lnot (\bar{a} \land \lnot b) = \bar{a} \leftrightarrow b

\phi_4 = (\lnot \bar{a} \land \bar{b}) \lor (\bar{a} \land \lnot \bar{b}) = \bar{a} \oplus \bar{b}

As with Mermin’s example, these four propositions can be seen to be contradictory. Rather than trying to make up more stupid dialogues, I’ll just follow the method in the paper. First use \phi_3 to replace \bar{a} with b in \phi_4:

\phi_4 = b \oplus \bar{b} .

Then use \phi_1 to swap out b again, this time with a:

\phi_4 = a \oplus \bar{b} .

Finally use \phi_2 to swap out a with \bar{b}, leaving

\bar{b} \oplus \bar{b}

which is clearly contradictory.

(Sidenote: I guess these sort of arguments to show a contradiction do involve some thinking, which is what I was trying to avoid earlier. But in each case you could just draw out a truth table, which is a stupid method that a computer could do. So I think it’s reasonable to say that this is less thinking than Mermin’s method.)

Again, this violates the logical Bell inequality. In total, we have

\sum_i p_i = 1 + \frac{3}{4}  + \frac{3}{4}  + \frac{3}{4} = 3.25 > 3.

The textbook version of this inequality is a bit different. For a start, it uses an ‘expectation value’ for each proposition rather than a straightforward probability, where truth is associated with +1 and falsity with -1. So each proposition \phi_i has an expectation value E_i with

E_i = (+1)\cdot p_i + (-1)\cdot (1-p_i) = 2p_i -1.

Then summing over the E_is gives

\sum_i E_i = \sum_i (2p_i-1) = 2\sum_i p_i - N

and then, using the previous form of the logical Bell inequality,

\sum_i E_i \leq 2(N-1) - N = N-2.

A similar argument for -E_i shows that \sum_i E_i \geq -(N-2), so that this is a bound above and below:

|\sum_i E_i| \leq N - 2.

In this case N = 4 and so the inequality becomes |\sum_i E_i| \leq 2. However adding up the E_is associated to the propositions \phi_i gives 2.5, so the inequality is violated.

There’s still a little further to go to get the textbook version, but we’re getting close. The textbook version writes the CHSH inequality as

| E(a,b) + E(\bar{a}, b) + E(a, \bar{b}) - E(\bar{a}, \bar{b}) | < 2.

where the expectation value is written in the form

E(a,b) = \int A(a,\lambda) B(b, \lambda)\rho(\lambda) d\lambda.

The \lambda are ‘hidden variables’ – properties of the particles that dispose them to act in various ways. For example, in the Mermin case, we imagined them to have hidden states, like



that controlled their response to each dial, and showed that any choice of these hidden states would lead to a contradiction.

For a given \lambda, A(\lambda, a) and B(\lambda, b) are the values measured by the left and right hand machines respectively. In our case these values are always either +1 (if the machine flashes T) or -1 (if the machine flashes F). The CHSH argument can also be adapted to a more realistic case where some experimental runs have no detection at all, and the outcome can also be 0, but this simple version won’t do that.

For the dial settings a and b, all we care about with these hidden variables is whether they make the machines respond true or false. So in our case \lambda is just a set of four variables, \lambda = { a\land b, a\land \lnot b, \lnot a\land b, \lnot a\land\lnot b }, and the integral can just become a sum:

E(a,b) = (+1 \times +1)\cdot p(a\land b) + (+1 \times -1)\cdot p(a\land \lnot b) + (-1 \times +1)\cdot p(\lnot a\land b) + (-1 \times -1)\cdot p(\lnot a\land \lnot b)

= p(a\land b) + p(\lnot a\land \lnot b) - p(a\land \lnot b) - p(\lnot a\land b).

= p((a\land b) \lor \lnot (a\land \lnot b)) - p((a\land \lnot b) \lor(\lnot a\land b)).

Now that first proposition (a\land b) \lor \lnot (a\land \lnot b) is just \phi_1 from earlier, which had probability p_1. And the second one covers all the remaining possibilities, so it has probability 1-p_1. So

E(a,b) = p_1 - (1-p_1) = 2p_1 - 1 = E_1.

The argument goes through exactly the same way for E(a, \bar{b}) and E(\bar{a}, b). The last case, E(\bar{a}, \bar{b}), is slightly different. We get

E(\bar{a}, \bar{b}) = p((\bar{a}\land \bar{b}) \lor \lnot (\bar{a}\land \lnot \bar{b})) - p((\bar{a}\land \lnot \bar{b}) \lor(\lnot \bar{a}\land \bar{b}))

following the same logic as before. But this time \phi_4 matches the second proposition (\bar{a}\land \lnot \bar{b}) \lor(\lnot \bar{a}\land \bar{b}), not the first, so that

E(\bar{a}, \bar{b}) = (1-p_4) - p_4 = 1 - 2p_4 = -E_4.

This is where the minus sign in the CHSH inequality comes in! We have

|\sum_i E_i| = | E(a, b) + E(a, \bar{b}) + E(\bar{a}, b) - E(\bar{a}, \bar{b}) | \leq 2.

So we end up with the standard inequality, but with a bit more insight into where the pieces come from. Also, importantly, it’s easy to extend to other situations. For example, you could follow the same method with the six Mermin propositions from earlier to make a kind of ‘Mermin-CHSH inequality’:

|\sum_i E_i| = | E(a, b) + E(a', b') + E(a'', b'') - E(a, b') - E(a, b'') - E(a', b'') | \leq 4.

Or you could have three particles, or a different set of measurements, or you could investigate what happens with other tables of correlations that don’t appear in quantum physics… this is a very versatile setup. The original paper has many more examples.

Final thoughts

There are still some loose ends that it would be good to tie up. I’d like to understand exactly how the inequality-shuffling in a ‘textbook-style’ proof of the CHSH inequality connects to Abramsky and Hardy’s version. Presumably some of it is replicating the same argument, but in a more opaque form. But also some of it must need to deal with the fact that it’s a more general setting, and includes things like measurements returning 0 as well as +1 or -1. It would be nice to figure out which bits are which. I think Bell’s original paper didn’t have the zero thing either, so that could be one place to look.

On the other hand… that all sounds a bit like work, and I can’t be bothered for now. I’d rather apply some of this to something interesting. My next post is probably going to make some connections between the logical Bell inequalities and my previous two posts on negative probability.

If you know the answers to my questions above and can save me some work, please let me know in the comments! Also, I’d really like to know if I’ve got something wrong. There are a lot of equations in this post and I’m sure to have cocked up at least one of them. More worryingly, I might have messed up some more conceptual points. If I’ve done that I’m even more keen to know!

Negative probability: now with added equations!

OK, so this is where I go back through everything from the last post, but this time show how all the fiddling around with boxes relates back to quantum physics, and also go into some technical details like explaining what I meant by ‘half the information’ in the discussion at the end. This is unavoidably going to need more maths than the last post, and enough quantum physics knowledge to be OK with qubits and density matrices. I’ll start by translating everything into a standard physics problem.

Qubit phase space

So, first off, instead of the ‘strange machine’ of the last post we will have a qubit state – as a first example I’ll take the |0\rangle state. The three questions then become measurements on it. Specifically, these measurements are expectation values q_i of the operators Q_i = \frac{1}{2}(I-\sigma_i), where the \sigma_i are the three Pauli matrices.

For |0\rangle we get the following:

q_z = \langle 0 | Q_z | 0 \rangle = 0

q_x = \langle 0 | Q_x | 0 \rangle = \frac{1}{2}

q_y = \langle 0 | Q_y | 0 \rangle = \frac{1}{2}

This can be represented on the same sort of 2×2 grid I used in the previous post:

The |0\rangle state has a definite value of 0 for the Q_z measurement, so the probabilities in the cells where Q_z = 0 must sum to 1. For the Q_x state there is an equal chance of either Q_x = 0 or Q_x = 1. The third measurement, Q_y, can be shown to be associated with the diagonals of the grid, in the same way as in Piponi’s example in the previous post, and again there is an equal chance of either value. Imposing all these conditions gives the probability assignment above.

The 2×2 grid is called the phase space of the qubit, and the function that assigns probabilities to each cell is called the Wigner function W. To save on drawing diagrams, I’ll represent this as a square-bracketed matrix from now on:

W = \begin{bmatrix} W(0,1) && W(1,1) \\ W(0,0) && W(1,0) \end{bmatrix}

For much more detail on how this all works, the best option is probably to read Wootters, who developed a lot of the ideas in the first place. There’s his original paper, which has all the technical details, and a nice follow-up paper on Picturing Qubits in Phase Space which gives a bit more intuition for what’s going on.

In the previous post I gave the following formula for the Wigner function:

W = \frac{1}{4}\Bigg( \begin{bmatrix}1 && 1 \\ 1 && 1 \end{bmatrix} \nonumber + q_z\begin{bmatrix}-1 && 1 \\ -1 && 1 \end{bmatrix} + (1-q_z)\begin{bmatrix}1 && -1 \\ 1 && -1 \end{bmatrix}

\quad +q_x\begin{bmatrix}1 && 1 \\ -1 && -1 \end{bmatrix} + (1-q_x)\begin{bmatrix}-1 && -1 \\ 1 && 1 \end{bmatrix} + q_y\begin{bmatrix}1 && -1 \\ -1 && 1 \end{bmatrix} + (1-q_y)\begin{bmatrix}-1 && 1 \\ 1 && -1 \end{bmatrix} \Bigg),

which simplifies to

W = \frac{1}{2}\begin{bmatrix}-q_z + q_x + q_y && q_z + q_x - q_y \\ 2 - q_z - q_x - q_y && q_z -q_x + q_y\end{bmatrix}

This is a somewhat different form to the standard formula for the Wigner function, but I’ve checked that they’re equivalent. I’ve put the details on a separate notes page here, in a sort of blog post version of the really boring technical appendix you get at the back of papers.

Magic states

As with the example in the last blog post, it’s possible to get qubit states where some of the values of the Wigner function are negative. The numbers don’t work out so nicely this time, but as one example we can take the qubit state |\psi\rangle = \frac{1}{\sqrt{1 + (1+\sqrt{2})^2}}\begin{pmatrix} 1 + \sqrt{2} \\ 1 \end{pmatrix}. (This is the +1 eigenvector of the density matrix \frac{1}{2}\left(\sigma_z + \sigma_x\right).)

The Wigner function for |\psi\rangle is

W_\psi = \begin{bmatrix} \frac{1}{4} && \frac{1-\sqrt{2}}{4} \\ \frac{1 + \sqrt{2}}{4} && \frac{1}{4} \end{bmatrix} \approx \begin{bmatrix} 0.25 && -0.104 \\ 0.604 && 0.25 \end{bmatrix},

with one negative entry. I learned while writing this that the states with negative values are called magic states by quantum computing people! These are the states that provide the ‘magic’ for quantum computing, in terms of giving a speed-up over classical computing. I’d like be able to say more about this link, but I’ll never finish the post if I have to get my head around all of that too, so instead I’ll link to this post by Earl Campbell that goes into more detail and points to some references. A quick note on the geometry, though:

The six eigenvectors of the Pauli matrices form the corners of an octahedron on the Bloch sphere, as in my dubious sketch above. We’ve already seen that the |0\rangle state has no magic – all the values are nonnegative. This also holds for the other five, which have the following Wigner functions:

W_{|1\rangle} = \begin{bmatrix} 0 && \frac{1}{2} \\ 0 && \frac{1}{2} \end{bmatrix}, W_{|+\rangle} = \begin{bmatrix} 0 && 0 \\ \frac{1}{2} && \frac{1}{2} \end{bmatrix}, W_{|-\rangle} = \begin{bmatrix} \frac{1}{2} && \frac{1}{2} \\ 0 && 0 \end{bmatrix},

W_{|y_+\rangle} = \begin{bmatrix} 0 && \frac{1}{2} \\ \frac{1}{2} && 0 \end{bmatrix}, W_{|y_-\rangle} = \begin{bmatrix} \frac{1}{2} && 0 \\ 0 && \frac{1}{2} \end{bmatrix}.

The other states on the surface of the octahedron or inside it also have no magic. The magic states are the ones outside the octahedron, and the further they are from the octahedron the more magic they are. So the most magic states are on the surface of the sphere opposite the middle of the triangular faces.

Half the information

Why can’t we have a probability of -\frac{1}{2} as before? Well, I briefly mentioned the reason in the previous blog post, but I can go into more detail now. There are constraints on the values of W that forbids values that are this negative. First off, the values of W have to sum to 1 – this makes sense, as they are supposed to be something like probabilities.

The second constraint is more interesting. Taking the |0\rangle state as an example again, this state has a definite answer to one of the questions and no information at all about the other two. There’s redundancy in the questions, so exact answers to two of them would be enough to pin down the state precisely. So we have half of the possible information.

This turns out to be the most information you can get from any qubit state, in some sense. I say ‘in some sense’ because it’s a pretty odd definition of information.

I learned about this from a fascinating paper by van Enk, A toy model for quantum mechanics, which was actually my starting point for thinking about this whole topic. He starts with the Spekkens toy model, a very influential idea that reproduces a number of the features of quantum mechanics using a very simple model. Again, this is too big a topic to get into all the details, but the most basic system in this model maps to the six ‘non-magic’ qubit states listed above, in the corners of the octahedron. These all share the half-the-knowledge property of the |0\rangle state, where we know the answer to one question exactly and have no idea about the others.

Now van Enk’s aim is to extend this idea of ‘half the knowledge’ to more general probability distributions over the four boxes. But this requires having some kind of measure M of what half the knowledge means. He stipulates that this measure should have M = \frac{1}{2} for the six half-the-knowledge states we already have, which seems reasonable. Also, it should have M = 1 for states where we know all the information (impossible in quantum physics), and M = \frac{1}{4} for the state of total ignorance about all questions. Or to put it a bit differently,

M = 2^{-H},

where H is an entropy measure – it decreases from 2 to 1 to 0 as we learn more information about the system. There’s a parametrised family H_\alpha of entropies known as the Rényi entropies, which reproduce this behaviour for the cases above, and differ for other distributions over the boxes. (I have some rough notes about these here, which may or may not be helpful.) By far the most well-known one is the Shannon entropy H_1, used widely in information theory, but it turns out that this one doesn’t reproduce the states found in quantum physics. Instead, van Enk picks H_2, the collision entropy. This has quite a simple form:

H_2 = -\log_2 \left(\sum_i W_i^2 \right),

where the W_i are the four components of W – we’re just summing the squares of them. So then our information measure is just M_2 = \sum_i W_i^2, and the second constraint on W is this can have value at most \frac{1}{2}:

\sum_i W_i^2 \leq \frac{1}{2}.

Why this particular entropy measure? That’s something I don’t really understand. Van Enk describes it as ‘the measure of information advocated by Brukner and Zeilinger’, and links to their paper, but so far I haven’t managed to follow the argument there, either. If anyone reads this and has any insight, I’d like to know!


In some ways, I know a lot more about negative probabilities than I did when I started getting interested in this. But conceptually I’m almost as confused as I was at the start! I think the main improvement is that I have some more focussed questions to be confused about:

  • Is the way of decomposing the Wigner function that I described in these posts any use for making sense of the negative probabilities? I found it quite helpful for Piponi’s example, in giving some more insight into how the negative value connects to that particular answer being ‘especially inconsistent’. Is it also useful for thinking about qubits?
  • Any link to the idea of negative probabilities representing events ‘unhappening’? As I said at the beginning of the first post, I love this idea but have never seen it fully developed anywhere in a satisfying way.
  • What’s going on with this collision entropy measure anyway?

I’m not a quantum foundations researcher – I’m just an interested outsider trying to understand how all these ideas fit together. So I’m likely to be missing a lot of context that people in the field would have. If you read this and have pointers to things that I’m missing, please let me know in the comments!

Negative probability

I’ve been thinking about the idea of negative probabilities a lot recently, and whether it’s possible to make any sense of them. (For some very muddled and meandering background on how I got interested in this, you could wade through my ramblings here, here, here and herebut thankfully none of that is required to understand this post.)

To save impatient readers the hassle of reading this whole thing: I’m not going to come up with any brilliant way of interpreting negative probabilities in this blog post! But recently I did notice a few things that are interesting and that I haven’t seen collected together anywhere else, so I thought it would be worth writing them up.

Now, why would you even bother trying to make sense of negative probabilities? I’m not going to go into this in any depth – John Baez has an great introductory post on negative probability that motivates the idea, and links to a good chunk of the (not very large) literature. This is well worth reading if you want to know more. But there are a couple of main routes that lead people to get interested in this thing.

The first route is pretty much pure curiosity: what happens if we try extending the normal idea of probabilities to negative numbers? This is often introduced in analogy with the way we often use negative numbers in applications to simplify calculations. For example, there’s a fascinating discussion of negative probability by Feynman which starts with the following simple situation:

A man starting a day with five apples who gives away ten and is given eight during the day has three left. I can calculate this in two steps: 5 – 10 = -5 and -5 + 8 = 3.

The final answer is satisfactorily positive and correct although in the intermediate steps of calculation negative numbers appear. In the real situation there must be special limitations of the time in which the various apples are received and given since he never really has a negative number, yet the use of negative numbers as an abstract calculation permits us freedom to do our mathematical calculations in any order, simplifying the analysis enormously, and permitting us to disregard inessential details.

So, although we never actually have a negative number of apples, allowing them to appear in intermediate calculations makes the maths simpler.

The second route is that negative probabilities actually crop up in exactly this way in quantum physics! This isn’t particularly obvious in the standard formulation learned in most undergrad courses, but the theory can also be written in a different way that closely resembles classical statistical mechanics. However, unlike the classical case, the resulting ‘distribution’ is not a normal probability distribution, but a quasiprobability distribution that can also take negative values.

As with Feynman’s apples, these negative values don’t map to anything we observe directly: all measurements we could make give results that occur with zero or positive probabilities, as you would expect. The negative probabilities instead come in as intermediate steps in the calculation.

This should become clearer when I work through a toy example. The particular example I’ll use (which I got from an excellent blog post by Dan Piponi) doesn’t come up in quantum physics, but it’s very close: its main advantage is that the numbers are a bit simpler, so it’s easier to concentrate on the ideas. I’ll do this in two pieces: one that requires no particular physics or maths background and just walks through the example using basic arithmetic, and one that makes connections back to the quantum mechanics literature and might drop in a Pauli matrix or two. This is the no-maths one.

Neither of these routes really get to the point of fully making sense of negative probabilities. In the apple example, we have a tool for making calculations easier, but we also have an interpretation of ‘a negative apple’, in terms of taking away one of the apples you have already. For negative probabilities, we mostly just have the calculational tool. It’s tempting to try and follow the apple analogy and interpret negative probabilities as being to do with something like ‘events unhappening’ – many people have suggested this (see e.g. Michael Nielsen here), and I certainly share the intuition that something like this ought to be possible, but I’ve never seen anything fully worked out along those lines that I’ve found really satisfying.

In the absence of a compelling intuitive explanation, I find it helpful to work through examples and get an idea of how they work. Even if we don’t end up with a good explanation for what negative probabilities are, we can see what they do, and start to build up a better understanding of them that way.

A strange machine

OK, so let’s go through Piponi’s example (here’s the link again). He describes it very clearly and concisely in the post, so it might be a good idea to just switch to reading that first, but for completeness I’ll also reproduce it here.

Piponi asks us to consider a case where:

a machine produces boxes with (ordered) pairs of bits in them, each bit viewable through its own door.

So you could have 0 in both boxes, 0 in the first and 1 in the second, and so on. Now suppose we ask the following three questions about the boxes:

  1. Is the first box in state 0?
  2. Is the second box in state 0?
  3. Are the boxes both in the same state?

I’ll work through two possible sets of answers to these questions: one consistent and unobjectionable set, and one inconsistent and stupid one.

Example 1: consistent answers

Let’s say that we find that the answer to the first question is ‘yes’ , the answer to the second is ‘no’, and the answer to the third is ‘no’. This makes sense, and we can interpret this easily in terms of an underlying state of the two boxes. The first box is in state 0, the second box is in state 1, and so of course the two are in different states and the answer to the third question is also satisfied.

We can represent this situation with the grid below:

The system is in state ‘first box 0, second box 1’, with probability 1, and the other states have probability 0. This is all very obvious – I’m just labouring the point so I can compare it to the case of inconsistent answers, where things get weird.

Example 2: inconsistent answers

Now suppose we find a inconsistent set of answers when we measure the box: ‘no’ to all three questions. This doesn’t make much intuitive sense: both boxes are in state 1, but also they are in different states. Still, Piponi demonstrates that you can still assign something like ‘probabilities’ to the squares on the grid, as long as you’re OK with one of them being negative:

Let’s go through how this matches up with the answers to the questions. For the first question, we have

P(\text{first box 0}) = P(\text{first box 0, second box 0}) + P(\text{first box 0, second box 1})

P(\text{first box 0}) = -\frac{1}{2} + \frac{1}{2} = 0

so the answer is ‘no’ as required. Similarly, for the other two questions we have

P(\text{second box 0}) = P(\text{first box 0, second box 0}) + P(\text{first box 1, second box 0})

P(\text{second box 0}) = -\frac{1}{2} + \frac{1}{2} = 0


P(\text{boxes same}) = P(\text{first box 0, second box 0}) + P(\text{first box 1, second box 1})

P(\text{boxes same})  = -\frac{1}{2} + \frac{1}{2} = 0

so we get ‘no’ to all three, at the expense of having introduced this weird negative probability in one cell of the grid.

It’s not obvious at all what the negative probability means, though! Piponi doesn’t explain how he came up with this solution, but I’m guessing it’s one of either ‘solve the equations and get the answer’ or ‘notice that these numbers happen to work’.

I wanted to think a bit more about interpretation, and although I haven’t fully succeeded, I did notice a more enlightening calculation method, which maybe points in a useful direction. I’ll describe it below.

A calculation method

Some motivating intuition: all four possible assignments of bits to boxes are inconsistent with the answers in Example 2, but ‘both bits are zero’ is particularly inconsistent. It’s inconsistent with the answers to all three questions, whereas the other assignments are inconsistent with only one question each (for example, ‘both bits are 1’ matches the answer to the first two questions, but is inconsistent with the two states being different).

So you can maybe think in terms of consecutively answering the three questions and penalising assignments that are inconsistent. ‘Both bits are zero’ is an especially bad answer, so it gets clobbered three times instead of just once, pushing the probability negative.

The method I’ll describe is a more formal version of this. I’ll go through it first for Example 1, with consistent answers, to show it works there.

Back to Example 1

Imagine that we start in a state of complete ignorance. We have no idea what the underlying state is, so we just assign probability ¼ to each cell of the grid, like this:

(I’ll stop drawing the axes every time from this point on.) We then ask the three questions in succession and make corrections. For the first question, ‘is the first box in state 0’, we have the answer ‘yes’, so after we learn this we know that the left two cells of the grid now have probability ½ each, and the right two have probability 0. We can think of this as adding a correction term to our previous state of ignorance:

Notice that the correction term has some negative probabilities in it! But these seem relatively benign from an interpretational point of view – they are just removing probability from some cells so that it can be reassigned to others, and the final answer is still positive. It’s kind of similar to saying P(\text{heads}) = 1 - P(\text{tails}), where we subtract some probability to get to the answer.

Next, we add on two more correction terms, one for each of the remaining two questions. The correction term for the second question needs to remove probability from the bottom row and add it to the top row, and the one for the third question corrects the diagonals:

Adding ‘em all up gives

So the system is definitely in the top left state, which is what we found before. It’s good to verify that the method works on a conventional example like this, where the final probabilities are positive.

Example 2 again

I’ll follow the same method again for Piponi’s example, starting from complete uncertainty and then adding on a correction for each question (this time the answer is ‘no’ each time). This time I’ll do it all in one go:

which adds up to

So we’ve got the same probabilities as Piponi, with the weird negative -½ probability for ‘both in state 0’. This time we get a little bit more insight into where it comes from: it’s picking up a negative correction term from all three questions.


This ‘strange machine’ looks pretty bizarre. But it’s extremely similar to a situation that actually comes up in quantum physics. I’ll go into the details in the follow-up post (‘now with added equations!’), but this example almost replicates the quasiprobability distribution for a qubit, one of the simplest systems in quantum physics. The main difference is that Piponi’s machine is slightly ‘worse’ than quantum physics, in that the -½ value is more negative than anything you get there.

The two examples I did were ones where all three questions have definite yes/no answers, but my method of starting from a state of ignorance and adding on corrections carries over in the obvious way when you have a probability distribution over ‘yes’ and ‘no’. As an example, say you have a 0.8 probability of ‘no’ for the first question. Then you add 0.8 times the correction matrix for ‘no’, with the negative probabilities on the left hand side, and 0.2 times the correction matrix for ‘no’, with the negative probabilities on the right hand side. That’s all there is to it. Just to spell it out I’ll add the general formula: if the three questions have answer ‘no’ with probabilities q_1, q_2, q_3 respectively, then we assign probabilities to the cells as follows:

(If you’re wondering where the W comes from, it’s just the usual letter used to label this thing – it stands for ‘Wigner’, and is a discrete version of his Wigner function.)

It turns out that all examples in quantum physics are of the type where you don’t have certain knowledge of the answers to all three questions. It’s possible to know the answer to one of them for certain, but then you have to be completely ignorant about the other two, and assign probability ½ to both answers. More usually, you will have partial information about all three questions, with a constraint that the total information you get about the system is at most half the total possible information, in a specific technical sense. To go into this in detail will require some more maths, which I’ll get to in the next post.

One theory to the tune of another

My second favourite type of question in physics, after ‘what’s the simplest non-trivial example of this thing?’, is probably ‘how can I write these two things in the same formalism, so that the differences stand out more clearly?’

This may look like an odd choice, given that all I ever do here is grumble about how crap I am at picking up new formal techniques. But actually that’s part of why I like it!

Writing two theories in the same language is like putting two similar transparencies on top of each other, and holding them up to the light. Suddenly the genuine conceptual differences pop out visibly, freed from the distraction of all the tedious extraneous machinery that surrounds them.

Or at least that’s always the hope – it’s actually pretty hard work to do this.

There are two maps between classical and quantum physics that I’m interested in learning, and should probably have included in my crackpot grand plan. (I guess they can be shoved into the quantum foundations grab bag.)

One is the phase space reformulation of quantum mechanics. This is sort of a standard technique, but I still managed to avoid hearing about it until quite recently. Some subfields apparently use it a lot, but you’re unlikely to see it in any standard quantum course. It also has a weird lack of decent introductory texts. I met someone at the workshop I went to who uses it in their research and asked what I should read, and he just looked pained and said ‘My thesis, maybe? When I write it?’ So learning it may not be especially fun.

It looks really interesting though! You can dump all the operators and use something that looks very like a normal probability distribution, so the parallels with classical statistical mechanics are much more explicit. There are obviously differences – this distribution can be negative, for a start. (It’s known as a quasidistribution.) Ideally, I’d like to be able to hold them both up to the light and see exactly where all the differences are.

It’s less well known that you can also do classical mechanics on Hilbert space! It’s called Koopman – von Neumann theory. If you ever thought ‘what classical mechanics is really missing is a load of complex wavefunctions on configuration space’, then this is the formalism for you.

In this case, I ought to be luckier with the notes, because Frank Wilczek wrote some a couple of years ago.

I’m not so clear on exactly what this thing is and what I’d get out of learning it, but the novelty value of a Born rule in classical mechanics is high enough that I can’t resist giving it a go. And I’d have a new pair of formalisms to hold up to the light.