Sunday, September 13, 2020

Minimizing Bias in Experimental Particle Physics


The experiments at the Large Hadron Collider at CERN produce a total of about 90 petabytes of data per year. A relatively high-end desktop computer today may have a disk that can store about 1 terabyte of data, which means it takes the disk space of about 100,000 desktop computers to store the annual data collected by the CERN LHC experiments. The physics researchers who work at CERN use powerful computers to analyze this data. The goal of our analysis is always to learn more about the fundamental structure of the universe. We are particularly trying to probe aspects of nature that we do not yet fully understand and discover something that we do not yet know.
 
In very general terms, the analysis of our data can be divided into two very broad categories. The first category could be labeled "measurements" where we try to more precisely measure the value of some property of nature that we have already observed and for which we have a mathematical model that predicts what the measurement should reveal. If the measurements differ from expectations, then we have indirectly discovered something about nature previously unknown. The second category could be labeled "searches" where we actually search directly for a new undiscovered particle or a new undiscovered phenomena. Often, we are searching for something predicted by some proposed model of physics developed by a theoretical physicist. 

When looking for new phenomena it is vital that we do not introduce presuppositional ideas or bias into our experiments. It is well documented that human bias can subconsciously skew the experimental analysis. In my last blog post I mentioned that, "we go to great lengths to minimize any bias toward one conclusion or another, particularly when looking for some unknown phenomena or particle." A reader of my blog, Keith, asked a number of questions about how we minimize bias in our experiments.  The lengths we go to and the methods we use are quite informative and can have applications to other arenas in life where we want to come to some objective conclusions.

At the LHC, two different bunches of protons, each containing about 100,000,000,000 protons cross paths about 40 million times per second, resulting in about 50 protons colliding each time the bunches cross. If some process occurs frequently in nature and is quite common, then we have already studied it in depth and usually understand it. Consequently, the new phenomena we are looking for tend to be very rare. For instance, the Higgs Boson, which was discovered in 2012 at CERN, is produced only once for every ten billion proton collisions. It is difficult to find something new in the data that is very rare (a "signal") because the more common processes can produce similar effects in our detectors creating "background" events, which mimic the signal we are looking for. One of my colleagues says its harder than finding a needle in a haystack, and is more analogous to finding one particular piece of hay in the entire haystack. In order to find some new process in the data which is so rare, we develop criteria, or cuts, which tend to enhance the signal being searched for and minimize the background. As a very imperfect example, suppose you were trying to find a certain new type of exotic sports car among all the cars driving on a certain road. Keep in mind, you don't even know if such a sports car actually exists or not. You could make a "cut" like only looking for cars that can reach a speed of at least 200 miles per hour. But if you didn't see any such cars, you might introduce a bias by redefining that cut and, instead, look for cars that can reach a speed of 120 miles/hour. All of a sudden, with this different cut you are suddenly able to "find" a whole bunch of exotic sports cars. Your bias has manipulated your cuts until you found just what you wanted to find.

To avoid this, we develop our cuts without actually looking at the pertinent data. To do this, we first investigate computer simulated data which contains all the information we know about how nature works and about what we might be looking for that is undiscovered in nature. We tune our cuts on this computer generated simulated data without looking at the real data. In our example above, its as if we write a computer simulation that defines what our new undiscovered exotic sports car might look like if it is driving on a road with all the other cars we already know about. Since we know there are many cars that can go 120 miles/hour that are not our new undiscovered exotic sports car, this simulation would keep us from lowering our speed cut to such a low value. Once we have developed our cuts on this simulation, we then look at a subset of the real data to confirm whether our simulation is valid. Still in order to avoid bias, we first look at the data only in a region where we don't expect to see the new phenomena, what we call a "control" region. In the sports car analogy, that would be kind of like looking at a different road where we know for sure that only known types of cars drive on (not the new unknown type of exotic sports car) and we compare our computer simulation of the cars on that road with the actual road. Once we have confirmed that our simulation works on that different road and we have tuned the cuts to search for this new type of exotic sports car in our computer simulation, it is only then that we can look at the road we are actually interested in to see if the new car exists, all without changing the selection cuts.

As a real example, about two years ago I was involved in looking for the Higgs Boson when it decayed to two W particles —our "signal." However, another "background" process, the decay of two top quarks, can sometime look very much like our signal with one major difference — the top quark also produces what are called "b jets." So one of our selection criteria that is meant to find Higgs Bosons requires that there be no identified b jets in our events. (An "event" is the collision of two protons that produce particles in our detector). However, events that do have b-jets can be very useful as a control region to verify that our computer simulation is correct because they look almost exactly like the signal we want to find but we know they are not our signal since they contain a b-jet. Consequently, events with identified b-jets serve as a perfect control region to make sure we fully understand our data without biasing the subset of our data that may contain the actual signal we are searching for.

To answer one of Keith's questions, all of this described above is usually done by the same group of analysis experts — usually 2 to 5 people or so within our collaboration of about 3500 physicists — but before permission is given to look at the subset of real data that might contain a new discovery (what we call "opening the box" or, in the above car analogy, looking at the road of interest), an independent group of experts within the collaboration examines all the work done, including the computer simulations and the comparison with the control regions, and only then gives permission for the analyzers to look at the real data in the region where the new phenomena may be seen. Bias is avoided because everything is developed and tested without looking at the actual data which might contain any new discovery. When all techniques and processes are developed independent from the relevant data in a way that doesn't presuppose that a real signal may or may not be found in the data, bias is averted.

The point of this detailed description above is to emphasize that in my professional work I spend a great deal of effort developing techniques that allow me to try to analyze data without presupposed ideas or bias. 

Knowing that this blog is devoted to the intersection of science and Christianity, Keith asked pertinent questions about bias that may be introduced by historians when they try to interpret data surrounding the written accounts about the life of Jesus. He asks, "In historical document analysis do you believe the interpretation by many qualified scholars from both biblical and secular scholars from appropriate disciplines should be included in reaching a consensus opinion about the reliability of the Gospel accounts of the resurrection? Are these independent scholarly studies in some sense like independent scientific teams interpreting the results of a given experiments data set?" 

In the next blog post, I will try to answer these questions and discuss how bias and preconceived notions may influence the conclusion of historians and others as they examine the historical evidence surrounding the life of Jesus of Nazareth.


Figures above are taken from "Measurements of gluon–gluon fusion and vector-boson fusion Higgs boson production cross-sections in the H -> WW -> eνμν decay channel in pp collisions at √s = 13 TeV with the ATLAS detector," published in Physics Letters B, Volume 780, 10 February 2019, pages 508-529. The figure at the top shows the transverse mass distribution with the fitted Higgs Boson signal in red and the figure at the bottom shows the zero-jet top control region. For further details, see the referenced paper.

1 comment:

  1. Thanks Michael, This was well written and the car analogy was helpful to we lay persons. I imagine the data reduction techniques used to maximize the S/N ration, filter out noise, etc. are phenomenal in your world. In my OR days (I'll date myself) we were implementing the Richard Kalman filtering digitally as an upgrade from analog Wiener techniques. I am eagerly awaiting your application to the historicity of the Life of Our Lord and Savior Christ, particularly regarding the resurrection.

    ReplyDelete