I’m a big fan of vicarious learning. Sharing experiences and mistakes makes us all smarter and able to do more. Here, I share a rather brutal learning experience I recently encountered with the hope that you may not go through the same pain with your own research.
As my colleague Jerry Overton recently shared, CSC and Microsoft have teamed up to create an Industrial Machine Learning utility, which helps enterprises turn their data into data stories – a key component of a successful machine learning project.
In order to showcase what we can do for clients, we are developing a series of data stories that are applicable to six different industries: banking and capital markets, energy and technology, insurance, manufacturing, healthcare, and retail.
To showcase the selected data stories, we create simulated data. Simulations are useful for developing models when data has not yet been obtained. And they can fill out existing data with additional scenarios.
However, in order to be insightful, rather than just interesting, simulations and models need to be grounded in reality.
Recently, I developed a simulation that needed to generate 1,000,000 realistic “reservation periods” for rental cars, among 30 other data points. My mental use case was that the simulated data should look like someone had taken a cut of all reservations that had started and finished in the last year. This constraint proved to be challenging.
I started with two random uniform distributions of 1,000,000 records, rescaled them so that they fit between how R calculates today and a year ago, and then transformed them to actually look like dates. But this created a problem: About half of my end dates were before my start dates.
What then proceeded were nine hours (I’m not exaggerating) of trying and failing to get the reservations to behave realistically. First, I thought I could simply set all the invalid dates to “today”. But, that wouldn’t be realistic. Out of 1,000,000 transactions in a year’s time, 500,000 don’t end on the same day.
Eventually, I thought, “Ah, I can flip the difference!” I thought it important to preserve whatever random reservation length had been created, but then I introduced future end dates.
Whenever I got the dates conforming to one constraint, I blew another one out of the water. Just imagine several more hours of me weeping and gnashing my teeth – and uttering words I won’t reprint.
I finally came up with an idea that “worked” in that it conformed to the critical two of my three constraints: end date after start date, end date not after today. (We’ll talk about reasonable reservation lengths later.)
I would generate an end date by:
1. Calculating the number of days between the start date and today
2. Assigning that number to a variable called date_diff
3. Generating a random number between 0 and date_diff
4. Adding that random number as days to the start date
Then I would apply this code line by line to 1,000,000 records….
This code kinda worked? It got the job done, though I had to clean up a handful of NA’s that were introduced. It was really slooooowwwww and created stupidly long reservations. However, I was exhausted at this point and didn’t care. Another team needed my data and it was midnight. I didn’t like my code. It was slow and inelegant, but it got the job done.
The next day I attended a scrum meeting with the rest of the team that would be using my simulated data to generate the data story. I was pulled in late to the project, so this was the first time I had an opportunity to speak to them on the phone. While explaining the data and its limitations, I had an epiphany. Miraculously, I resisted the urge to squeal in their ears and instead told them that I would have an updated data set to them later that day.
I had gone about my solution all wrong, I realized. I didn’t need to generate the end date to calculate the reservation length. I needed to generate the reservation length to calculate the end date.
To simulate random reservation lengths:
- Generate a normal distribution of reservation lengths scaled to whatever is the mean reservation length for your question.
- Based on the mean reservation length calculate the max reservation length (assuming a minimum of 1, this would be roughly double the mean)
- Subtract the max reservation length from Today to be the upper limit of the start dates
- Generate a random uniform distribution for start dates
- Add the reservation lengths in days to your start dates to calculate the end dates.
Yes, at the end of the year I lose some short reservations. But out of a million reservations that I was not adjusting for seasonality, this was inconsequential.
I finished cleaning up the code, reran everything and about 3 minutes later I had 1,000,0000 transactions with realistic reservation dates spread out over 500,000 customers.
As I mentioned earlier, these reservation simulations were just part of a larger simulation, which in turn was part of a demo of CSC’s Industrial Machine Learning Utility offering. When I was done, I handed the data off to machine learning experts, who in turn handed it off to a visualization expert.
Our end-result was this dashboard built in Microsoft PowerBI:
Logan Wilt is a Data Scientist within CSC’s ResearchNetwork, and most recently was the Taxonomy Architect for CSC’s Integrated Workforce Management Center of Excellence. She is interested in practical tactical data science applications as well as leveraging semantic technologies for better data discovery.