Could someone help me in fitting the data collapse_fractions
with a lognormal function, which has median and standard deviation derived via the maximum likelihood method?
I tried scipy.stats.lognormal.fit(data)
, but I did not obtain the data I retrieved with Excel. The excel file can be downloaded: https://stacks.stanford.edu/file/druid:sw589ts9300/p_collapse_from_msa.xlsx
Also, any reference is really welcomed.
import numpy as np intensity_measure_vector = np.array([[0.2, 0.3, 0.4, 0.6, 0.7, 0.8, 0.9, 1]]) no_analyses = 40 no_collapses = np.array([[0, 0, 0, 4, 6, 13, 12, 16]]) collapse_fractions = np.array(no_collapses/no_analyses) print(collapse_fractions) # array([[0. , 0. , 0. , 0.1 , 0.15 , 0.325, 0.3 , 0.4 ]]) collapse_fractions.shape # (1, 8) import matplotlib.pyplot as plt plt.scatter(intensity_measure_vector, collapse_fractions)
Advertisement
Answer
I couldn’t figure out how to get the lognorm.fit
to work. So I just implemented the functions from your excel-file and used scipy.optimize
as the optimizer. The added benefit is, that it is easier to understand what is actually going on compared to lognorm.fit
especially with the excel on the side.
Here is my implementation:
from functools import partial import numpy as np from scipy import optimize, stats im = np.array([0.2, 0.3, 0.4, 0.6, 0.7, 0.8, 0.9, 1]) im_log = np.log(im) number_of_analyses = np.array([40, 40, 40, 40, 40, 40, 40, 40]) number_of_collapses = np.array([0, 0, 0, 4, 6, 13, 12, 16]) FORMAT_STRING = "{:<20}{:<20}{:<20}" print(FORMAT_STRING.format("sigma", "beta", "log_likelihood_sum")) def neg_log_likelihood_sum(params, im_l, no_a, no_c): sigma, beta = params theoretical_fragility_function = stats.norm(np.log(sigma), beta).cdf(im_l) likelihood = stats.binom.pmf(no_c, no_a, theoretical_fragility_function) log_likelihood = np.log(likelihood) log_likelihood_sum = np.sum(log_likelihood) print(FORMAT_STRING.format(sigma, beta, log_likelihood_sum)) return -log_likelihood_sum neg_log_likelihood_sum_partial = partial(neg_log_likelihood_sum, im_l=im_log, no_a=number_of_analyses, no_c=number_of_collapses) res = optimize.minimize(neg_log_likelihood_sum_partial, (1, 1), method="Nelder-Mead") print(res)
And the final result is:
final_simplex: (array([[1.07613697, 0.42927824], [1.07621925, 0.42935678], [1.07622438, 0.42924577]]), array([10.7977048 , 10.79770573, 10.79770723])) fun: 10.797704803509903 message: 'Optimization terminated successfully.' nfev: 68 nit: 36 status: 0 success: True x: array([1.07613697, 0.42927824])
The interesting part for you is on line one, the same final result as in the excel-calculation (sigma=1.07613697 and beta=0.42927824).
If you have any questions about what I did here, don’t hesitate to ask as you said you are new to python. A few things in advance:
- I did minimize the negative likelihood-sum as there is no maximizer in scipy.
- partial from functools returns a function that has the specified arguments already defined (in this case
im_l
,no_a
andno_c
as they don’t change) the partial function can then be called with just the missing argument. - The
neg_log_likelihood_sum
-function is basically whats happening in the excel-file, so it should be easy to understand when viewing it side-by-side. scipy.optimize.minimize
minimizes the function given as the first argument by changing the parameters (start-value as second argument). The method is chosen because it gave good results, I didn’t dive deep into the abyss of different optimization-methods, gradients etc. So it may well be, that there is a better setup, but this one works fine and seems faster than the optimization withlognorm.fit
.
The plot like in the excel-file is done like this with the results res
from the optimization:
import matplotlib.pyplot as plt x = np.linspace(0, 2.5, 100) y = stats.norm(np.log(res["x"][0]), res["x"][1]).cdf(np.log(x)) plt.plot(x, y) plt.scatter(im, number_of_collapses/number_of_analyses) plt.show()