Advanced Mixture Modeling
Elements: Parallel and high performance computing, GPU programming, expectation maximization, NVidia CUDA, C, Mathematica, interdisciplinary collaboration
Links: Demo notebook (PDF)      Mathematica code (ZIP)
Summary

This project arose from a request to assist with a data set possessing a long-tail distribution that was not being fitted adequately by the models being used. I found that the method (least squares fits to cumulative distributions) was outdated, limited to 1 dimensional distributions, and would tend to fit the dense portion of the distribution, ignoring the long tail.

I chose a more modern approach, using expectation maximization to escape all the above limitations. Another improvement was the use of extensive restarting to ensure that all local optima had been considered. (Generally speaking, nonlinear optimization methods only find local optima near starting points, requiring restarting at many randomly chosen starting points to cover all possible local optima.) Extensive restarting required enormous computing resources, and to accelerate it I targeted NVIDIA graphics cards (graphics processing units, or GPU’s).

The software I wrote fits mixtures of joint distributions. The joint distributions are user-defined to utilize a selected distribution on each axis of the data. The software iterates the number of mixture components, and for each number of components it randomly restarts the fitting process many times, and then clusters the resulting models. The models and plots are exported to Excel files for further analysis.

I succeeded in fitting the long tail distribution with 3 mixture components. Four mixture components slightly improved the fit. When fitting 5 or more components, components were most often duplicated, indicating that only 4 components had actually been found. This was the first time that more than 2 components had been identified in data of the type we were analyzing. The next thing to do will be to generate experimental data where the physical number of components is known, and determine whether expectation maximization can recover the number of components.