A question about the GCP

#1
I'm looking for a concise explanation as to why we are not able to look at incoming data at GCP and predict major events based on the same kind of data that showed up during other similar events. Or perhaps they've done this already?

I have very little understanding of statistics, and as a layperson I'm very curious about this. Thanks in advance.
 
#3
I still haven't heard an explanation for how all these bits of non-randomness should be showing up throughout the database, but the database as a whole is right at expectation.

Regular non-random strings in a huge sample like the GCP should, as I understand it, result in an overall database that is many standard deviations off from expectation.
 

Paul C. Anagnostopoulos

Nap, interrupted.
Member
#4
I still haven't heard an explanation for how all these bits of non-randomness should be showing up throughout the database, but the database as a whole is right at expectation.

Regular non-random strings in a huge sample like the GCP should, as I understand it, result in an overall database that is many standard deviations off from expectation.
I believe you can have windows where the Z-score deviates from the null without skewing the overall database. I may be wrong. But how are they measuring the standardness of the database?

Search for your favorite number in the digits of pi. '111111111' appears first at position 812,432,526.

http://www.subidiom.com/pi/

Note that whether any constant is normal is an outstanding problem in mathematics.

~~ Paul
 
#5
I believe you can have windows where the Z-score deviates from the null without skewing the overall database. I may be wrong. But how are they measuring the standardness of the database?

Search for your favorite number in the digits of pi. '111111111' appears first at position 812,432,526.

http://www.subidiom.com/pi/

Note that whether any constant is normal is an outstanding problem in mathematics.

~~ Paul
Here's my basic understanding: if the non-random string happens once, or not very often, then sure, it's going to get eaten up by variance. But if what we are talking about is regular insertions of non-random elements into the stream - which is what I understand is being alleged in the GCP - then it is going to show up in the overall results. My understanding of this comes from the context of discussions involving online poker from some pretty knoweldgeable stats guys. I could probably dig up some very old posts if anyone is interested enough, where one poster demonstrated mathematically how even small - but regular - insertions of non-random strings would eventually result in large deviations from expectation over large samples (and you don't get much larger in terms of samples than the entire GCP stream!)
 

Paul C. Anagnostopoulos

Nap, interrupted.
Member
#6
Here's my basic understanding: if the non-random string happens once, or not very often, then sure, it's going to get eaten up by variance. But if what we are talking about is regular insertions of non-random elements into the stream - which is what I understand is being alleged in the GCP - then it is going to show up in the overall results. My understanding of this comes from the context of discussions involving online poker from some pretty knoweldgeable stats guys. I could probably dig up some very old posts if anyone is interested enough, where one poster demonstrated mathematically how even small - but regular - insertions of non-random strings would eventually result in large deviations from expectation over large samples (and you don't get much larger in terms of samples than the entire GCP stream!)
Doesn't this depend on how the measure of nonrandomness for substrings interacts with the overall measure?

Let's take an example. Suppose we have 2000 samples. Now suppose the first 1000 samples are nonrandom, but the first 1,500 are not. Isn't the overall measure then random?

What do I know? We need a statistician. Oh, but how do we know that the overall sequence is statistically random?

~~ Paul
 
#7
Doesn't this depend on how the measure of nonrandomness for substrings interacts with the overall measure?

Let's take an example. Suppose we have 2000 samples. Now suppose the first 1000 samples are nonrandom, but the first 1,500 are not. Isn't the overall measure then random?

What do I know? We need a statistician.

~~ Paul
I'm not sure I followed your example, but the point as I understand it is that as long as the insertions of non-randomness occur often enough, even if tiny the overall database will be many standard deviations from expectation over a large sample. They actual block that poker website here at work but I'll try to remember when I get home to search for the post I'm talking about. It was several years ago but hopefully I can find it.
 

Paul C. Anagnostopoulos

Nap, interrupted.
Member
#8
I'm not sure I followed your example, but the point as I understand it is that as long as the insertions of non-randomness occur often enough, even if tiny the overall database will be many standard deviations from expectation over a large sample. They actual block that poker website here at work but I'll try to remember when I get home to search for the post I'm talking about. It was several years ago but hopefully I can find it.
Assume we have 2000 samples in the entire database. We find an anomaly in the first 1000 samples, so they are not random. But when we consider the first 1500 samples, they are random. There is nothing special about the final 500 samples. So isn't the overall database random? Perhaps not.

~~ Paul
 
#9
Assume we have 2000 samples in the entire database. We find an anomaly in the first 1000 samples, so they are not random. But when we consider the first 1500 samples, they are random. There is nothing special about the final 500 samples. So isn't the overall database random? Perhaps not.

~~ Paul
You seem to be talking about a one off in the first 1000? Or non-random elements in the first 1000 but not after that? I'm not sure whether that would necessarily affect the entire database or not, but what I'm referring to are regular insertions of non-randomness throughout the entire database.
 

Paul C. Anagnostopoulos

Nap, interrupted.
Member
#11
You seem to be talking about a one off in the first 1000? Or non-random elements in the first 1000 but not after that? I'm not sure whether that would necessarily affect the entire database or not, but what I'm referring to are regular insertions of non-randomness throughout the entire database.
What is the difference between one and many? And how do we count overlapping nonrandom sequences?

I should think it would have something to do with the percentage of the database that is nonrandom. But then I don't know what to do about overlapping sequences. Consider looking at subsequences of pi and calculating the mean of the digits. The expectation is 4.5. How many subsequences can we find that are significantly different from 4.5?

~~ Paul
 
#12
I'm looking for a concise explanation as to why we are not able to look at incoming data at GCP and predict major events based on the same kind of data that showed up during other similar events. Or perhaps they've done this already?

I have very little understanding of statistics, and as a layperson I'm very curious about this. Thanks in advance.
Prediction is a more robust way of demonstrating an effect.

I think the researchers have found that one cannot specify beforehand which range of eggs (the RNG devices) will have been found to be affected, which time period, nor what form the non-randomness will take, except for strictly recurring events. The patterns found, for which windows, and for which eggs, seem to vary for Burning Man, for example.

Of course, this unfortunately gives the appearance that the patterns found are the result of the post hoc pattern search (as described in the link which Paul gave).

It would make sense to focus on the New Years data, as it wouldn't be as subject to this problem (assuming there is never any change in which eggs and which window are analyzed for which pattern). Has this been published?

Hmmm...I found this.

http://noosphere.princeton.edu/newyear.2014.html

Doesn't look at all promising for prediction.

Linda
 
#13
I guess we probably should stay out of it. I think you're asking, how would Radin and Nelson explain the inability to predict, not how I (or Paul or Arouet) would.

Linda
 
#14
I guess we probably should stay out of it. I think you're asking, how would Radin and Nelson explain the inability to predict, not how I (or Paul or Arouet) would.

Linda
Well, like I said I'm not a statistician. I appreciate any feedback in this discussion, but I'm still having a hard time understanding it.

I'm not asking why Radin and Nelson are unable to explain an inability, because I don't yet know that that's the reality. Have they been unable to make predictions?

Let's work off of the notion that the GCP is working as they claim it does. Is there some way in which it is possible that relevant data is only discoverable after an important event? Do they have to go looking for it? And when they do, what are they looking for?
 
#15
Let's take a step back for a minute. Say a big event happens, like a tsunami or WTC. What happens then? They use a protocol to go back to the data and locate statistically significant anomalies?
 

Paul C. Anagnostopoulos

Nap, interrupted.
Member
#16
Let's take a step back for a minute. Say a big event happens, like a tsunami or WTC. What happens then? They use a protocol to go back to the data and locate statistically significant anomalies?
Yes. Part of the problem is that they can vary the position and size of the analysis window to find deviations from random. How much they do that is not clear to me.

~~ Paul
 
#18
Yes. Part of the problem is that they can vary the position and size of the analysis window to find deviations from random. How much they do that is not clear to me.
The only barrier I can work out logically to locating and pinpointing spikes that could act as predictive measurements would be time frame. For example, if it takes 2 weeks to properly analyze the data and locate a spike then by the time you've found it the event will have already happened.

In general I have no problem with GCP working backward by finding the data only after an event. That doesn't make or break the project for me. But I'm a bit puzzled no one seems to understand why. If the GCP could ramp it up to the next level it would have tremendous practical value, such as acting as a early warning system.
 

Paul C. Anagnostopoulos

Nap, interrupted.
Member
#19
In general I have no problem with GCP working backward by finding the data only after an event. That doesn't make or break the project for me. But I'm a bit puzzled no one seems to understand why. If the GCP could ramp it up to the next level it would have tremendous practical value, such as acting as a early warning system.
I have a problem with post hoc analysis. At least the window position and size should be specified before checking the data.

~~ Paul
 
#20
The duration of the events has varied from one minute to 9 days. They list 7 different analysis recipes (what analysis will be applied to the data within the event period), although they say that mostly "network variance" is used. The data is used from individual EGGs or from groups of EGGs. And you can take that data in one second blocks or in blocks of varying lengths (minutes, hours). All those choices go into creating an "event statistic" which can be tested for significance. So in terms of prediction, without knowing how to narrow it down, for each second of data you have to calculate an event statistic for all of those possibilities. So each second of data will generate thousands of "event statistics". Even if we look at some of the relatively improbable event statistics which have been generated going the other direction (waiting for an event and then coming up with some parameters), there are a handful at a p-value of less than 0.01. If we used this as our threshold, we'd identify dozens of "events" every second. If we used a much higher threshold, then we'd miss any of the events which have already been identified. That's why you can't use it for prediction - the event statistic has almost no sensitivity or specificity.

Linda
 
Top