-
With this mutation rate, what is the average amount of time it will take for a mutation to occur?
days
If we let $T$ be the time until the mutation, we can call $T$ the waiting time. The expected (mean) waiting time, is therefore $_$ days.
-
To begin, let's think about modeling this scenario in discrete time, where we let the time unit be a day. We can perform an experiment of this scenario and see what outcome we get.
Starting with day 1, we flip a coin to determine if there was a mutation. This coin will come out Heads with probability 0.005. If we get a Head, we record what day we get the mutation. If we get a Tail, we continue to the next day. We continuing marching through days until we finally get a Head. The outcome of our experiment is the time $T$, which is the day where we finally got a Head (the cell got its cancerous mutation), and we stopped the experiment.
We repeat this experiment over and over again, recording each outcome. If we repeated long enough and then took an average of all the values of $T$, this average waiting time should be close to what number?
.
If, instead of taking the average, we instead recorded how many times the mutation occurred on day 1, how many times it occurred on day 2, how many times it occurred on day 3, etc., on which day do you think you would have recorded the largest number of mutations? The largest number of mutations would occur on day $T=$
.
-
We could determine the answer for the previous question if we knew the probability distribution of the random variable $T$. We know some properties of the probability distribution, since we know the mean value of $T$, we know $T$ can take on positive integer values, and we know that it would be unlikely for $T$ to take on extremely large values, such as $10^{10}$.
The probability distribution (or probability mass function) for $T$ is a function $f_T(t)$ of day $t$ that gives the probability that the waiting time will be $t$ days:
$$P(T=t) = f_T(t).$$
Before calculating a formula for the probability distribution, let's explore what it looks like by doing many such experiments with flipping coins and finding the first Head. Since it may be difficult to find a coin that has a probability of 0.005 of coming up Heads, we can use a computer program to approximate such a random toss. In R, we can use the function rbinom to flip the coin. A call to rbinom(1,1,0.005) will return a 1 with probability 0.005 and return a 0 with probability 1-0.005. All we need to do is create an infinite loop (by using while(TRUE)) and keep flipping coins by calling rbinom inside the loop. If we happen to get a 1 from rbinom, we got our Heads and the cell mutated. We record the number of flips (which stands for the number of days), which is our outcome $T$.
The following R code will perform such an experiment:
lambda=0.005
day = 0
while(TRUE) { # repeat forever until break
day=day+1
# flip a coin that will be 1 with probability lambda
if (rbinom(1,1,lambda)==1) {
# a mutation occured
# break out of loop
break
}
}
When the loop exits, the variable day will hold the value of the outcome $T$.
-
Since we want to run a bunch of experiments, we might as well create another loop to run the above code over and over again. If we let the variable n_sample control how many experiments we will run, then we can create a vector of all zeros with rep.int(0,n_samples) and store the value of day in that vector at the end of the experiment.
If you add this code around the previous code, you'll get the results of 10 experiments in the vector days.
n_samples=10
lambda=0.005
# initialize days to be a vector of all zeros
days=rep.int(0,n_samples)
for(i in 1:n_samples) {
day = 0
while(TRUE) {
[flip coins until get a HEAD and record number of flips in the variable day]
}
# record day of mutation
days[i] = day
}
After running the code, you can check the mean waiting time with mean(days). This number should fluctuate around the mean value you calculated if you run the code multiple times.
Hint
The above R code won't run as it is written. To save writing, the code inside the while-loop is omitted. Copy the code from the while-loop of the previous part (the code inside the curly braces after while(TRUE)) and paste it to replace the text inside the square braces. If you get confused, you can always cheat and look at the code in the hint of the next part.
Hide help
-
Lastly, you can use the hist command to plot a histogram of the outcomes of the experiments. We'll pass the argument breaks=0:max(days) so that each bin of the histogram will be only one day wide (making a ragged looking histogram with thin bars because we don't want it to average over larger time bins). We'll give it the probability=TRUE argument sot that it will rescale the height of each bar so that the sum of all the heights is 1. In this case, the height of each bar will be the number of mutations that occurred in that day, divided by the total number of experiment, i.e., an estimate of the probability that $T$ was equal to that day.
We'll just show the first 1000 days of the histogram with the argument xlim=c(0,1000) so that the occasional really long experiment won't make the histogram scale even smaller and make the small day part of it harder to read. You can remove that argument to see the whole histogram. The full histogram command is the following.
days_show=1000
hist(days,breaks=0:max(days),xlim=c(0,days_show),
probability=TRUE, xlab="days", ylab="Probability per day",
main="Time to mutation")
When you run 10 experiments, what does the histogram look like?
If you repeat the set of experiments by running the code again and plotting a new histogram, what do you notice about the different versions of the histogram?
Do the results reveal which day is mostly likely for the mutation?
Hint
Here's the complete code to run the experiment 10 times and plot the result in a histogram.
n_samples=10
lambda=0.005
days_show = 1000
# initialize days to be a vector of all zeros
days=rep.int(0,n_samples)
for(i in 1:n_samples) {
day = 0
while(TRUE) { # repeat forever until break
day=day+1
# flip a coin that will be 1 with probability lambda
if (rbinom(1,1,lambda)==1) {
# a mutation occured
# break out of loop
break
}
}
# record day of mutation
days[i] = day
}
cat("average day of mutation:", mean(days))
hist(days,breaks=0:max(days),xlim=c(0,days_show),
probability=TRUE, xlab="days", ylab="Probability per day",
main="Time to mutation")
Hide help
-
To get a better idea of what the probability distribution $f_T(t)$ looks like, we need to take more samples. You can simply increase the variable n_samples to increase the number of experiments to run before computing the histogram.
Increase n_samples to 100 and run the code to view the histogram of outcomes from 100 experiments. Increase n_samples even further to 1000 and then 10000. The histogram should start to fill in and reveal which days are most likely for the mutation. Which days have more mutations?
.
Even with 10,000 samples, the histogram will be too ragged to see precisely which day is most likely. (You'd more than a million samples, but that would take too long to simulate given the way in which we are sampling.) But, can you infer from the trend which day should be mostly likely for the mutation? It appears that the largest number of mutations would occur on day $T=$
.
-
Let's now analyze the mutation process to determine the actual probability distribution $f_T(t)$. The first number is easy. What is the probability that there is a mutation on day 1? $P(T=1) = f_T(1) =$
.
What's the probability we didn't have a mutation on day 1 (i.e., the probability that $T>1$)? $P(T > 1) =$
If we didn't have a mutation on day 1, we move onto day 2.
Given that that there was no mutation on day 1 (i.e., conditioned on $T>1$), what is the probability that there is a mutation on day 2 (i.e., the probability that $T=2$)? The second coin flip has the same probability of Heads as the first, so the probability of $T=2$ conditioned on the fact that $T>1$ is $P(T=2 \,|\, T>1) = $
We, however, are interested in the overall probability that there was a mutation on day 2, the probability $P(T=2)$. $P(T=2)$ is the probability that (a) there was no mutation on day 1 and (b) there was subsequently a mutation on day 2. This probability is the product of $P(T > 1)$ times $P(T=2 \,|\, T>1)$. The probability that the mutation occurred on day 2 is $P(T=2) = f_T(2)=$
.
What is more likely: that the mutation occurred on day 1 or that the mutation occurred on day 2?
-
What is the probability that the mutation did not occur on day 1 or on day 2 (i.e., that you flipped Tails twice in a row so that $T > 2$)? $P(T>2) = $
(Include at least 5 significant digits in all your responses.) Given that $T > 2$, what's the probability that $T=3$? $P(T=3 \,|, T>2) =$
Therefore, what is the probability that the mutation occurred on day 3?
$P(T=3) = P(T=3 \,|, T>2)P(T > 2)=$
What is more likely: that the mutation occurred on day 2 or that the mutation occurred on day 3?
-
In the above answers, you were using the fact that the probability of Heads is $\lambda = 0.005$. As we continue to develop the formula for the probability mass function (the probability distribution) $P(T=t) = f_T(t)$, the pattern will be easier to see, and the formula will be more useful later, if we use the symbol $\lambda$ rather than substituting its numerical value. Rewrite the above formulas using the symbol $\lambda$.
$P(T=1) =$
$P(T>1) =$
$P(T=2) =$
$P(T>2) =$
$P(T=3) =$
Hint
Online, enter $\lambda$ as
lambda or as the symbol λ.
Hide help
-
Can you continue the pattern for any number of days? What's the probability that you flip a Tails $t-1$ times in a row, i.e., what's the probability that $T > t-1$?
$P(T > t-1 )=$
Given that $T > t-1$, what's the probability you get a Heads on the next flip so that $T=t$?
$P(T=t \,|\, T > t-1) = $
Putting it all together, what is the probability of the mutation occurring on day $t$?
$P(T=t) = P(T=t \,|\, T > t-1)P(T > t-1) =$
This last expression is the probability mass function $f_T(t)$. You've determine that
$$P(T=t) = f_T(t) = _.$$
Each day the probability of the mutation gets
because you multiply by one more factor of
which is
than 1.
What's the probability the mutation occurs on day 200? $P(T=200) = f_T(200) =$
(Here we want a numerical answer so substitute $\lambda=0.005$; be sure to include at least five significant digits.)
-
You can add a plot of $f_T(t)$ to your histogram by defining the function
f_T = function(t) {
lambda*(1-lambda)^(t-1)
}
and then adding it to the graph with the command
curve(f_T, from=0, to=days_show, add=TRUE, col= "red", lwd=3)
When you compare the graph of $f_T(t)$ to the histogram with n_samples=10000, the two should be close, though the histogram should include a lot of “noise” with bar heights rising above and dropping below the curve of $f_T(t)$.
-
Given the definition of f_T, you can find the probability of the mutation occurring on any day. For example, you can enter f_T(200) to verify your above answer for $f_T(200)$. You can also calculate the probability of a bunch of days by giving a vector of days for the argument of the function. For example, you can enter f_T(1:20), and the output will be a vector of the probability of mutation for the first 20 days.
In R, the function sum will add up all the components of a vector. If you enter sum(f_T(1:20)), the output will be the probability that $T \le 20$. What is that probability?
$P(T \le 20) = \sum\limits_{t=1}^{20} f_T(t) = $
What is the probability that $T$ is 200 or less?
$P(T \le 200) = \sum\limits_{t=1}^{200} f_T(t) = $
What is the probability that $T$ is any integer from 1 to infinity? This answer is an infinite sum. You won't be able to calculate that directly in R, but if you can calculate enough terms of the sum to get a good idea of what the total should be.
$\sum\limits_{t=1}^{\infty} f_T(t) = $
This result should make sense, because $T$ has to be some positive integer.
-
Now that we have the probability distribution $f_T$, we can also verify the expected value of $T$ (or mean value of $T$). To calculate the expected value, we just need to add up the different values of $T$, multiplied by the probabilities of them occurring, which is $f_T(T)$. That way, values of $T$ that are more likely to occur are multiplied by a larger number. We should get the same result as if we did a really long simulation and then averaged all the different values of $T$ that we got.
The expected value of $T$ is the sum $\sum\limits_{t=1}^\infty t f_T(t)$. Although we won't be able to calculate all the infinite terms in R, we can calculate enough of them so that we'll can tell what the answer is.
Recall that f_T(1:5) is a vector of the values $f_T(1)$ through $f_T(5)$. To multiply $f_T(1)$ by $1$, $f_T(2)$ by $2$, all the way to multiplying $f_T(5)$ by $5$, you can type: f_T(1:5)*(1:5). R multiplies the first component of the first vector by the first component of the second vector, then multiplies their second components, and continues to their last components, calculating the numbers we want. If you type sum(f_T(1:5)*(1:5)), you'll add up those numbers, getting the sum $\sum_{t=1}^5 t f_T(t)$. What is the value of this sum?
Continue increasing 5 to larger numbers until the value stops changing. The result will be the expected value of $T$. What is the expected value (denoted $E(T)$)?
$E(T) = \sum\limits_{t=1}^\infty t f_T(t) = $
This value should match what you determined right at the beginning.