The art of the hack

I teach a course called “Mastering Data Science at Enterprise Scale: How to design and implement machine-learning solutions that improve your organization.” 

In the course, students learn, among other things, how to write algorithms like a pro.

To operate on the level of a professional data scientist, I tell them, you have to master the art of the hack — getting good at producing new, minimum-viable data products based on adaptations of assets you already have.

Recently, I ran a CSC-internal practice run of my class. I took students through an exercise to write their own machine-learning algorithm. Logan Wilt, a colleague of mine and new data scientist at CSC, took up the challenge with amazing skill. I asked her to explain how she did it, in her own words, below.

— Jerry Overton

 

My Thoughts on the Art of the Hack: A Case Study
By Logan Wilt

Data science as a field is often thought of as the intersection of computer science and statistics and, as such, is populated by people with backgrounds in math, software engineering or computer science. That is not my background.

I share this because there are actually numerous roads to data science; but many people without a formal background in computers are intimidated by programming or writing algorithms. My bachelor’s degree is in Business Administration: Management, and my previous work is in accounting, hospitality and human resources.

Movies, books, and TV shows depict programming as brilliant nerds furiously typing onto a blank screen and, 30 seconds later, hacking into the alien mothership. That’s fiction for any programming, and certainly not how programming works in data science. However, for people without a computer background, that’s our main reference point, even if it is false. The reality is that data science programming is incremental, evolutionary, because science is incremental and evolutionary.

If I have seen further than others, it is by standing upon the shoulders of giants.
-Issac Newton

That is The Art of the Hack.

Jerry’s Code & My Twist

First a note on my experience with R: I still very much consider myself a beginner. I’ve completed about six hours of tutorials, participated in some online classes and have maybe 50 hours of practice spread out over the last two years.

My assignment for Jerry’s class started with this piece of code that pulls in about 60 days of data on the S&P 500.

original-code

We were instructed to modify and make our own forecasts with the data.

I wanted to see if I could forecast the S&P 500 Close values based on another set of data. This led me to make a hypothesis: The S&P 500 is affected by the overall social mood in the United States.

Admittedly, that’s a really simple hypothesis on the S&P. Really what I wanted to do was play around with two sets of data and see if I could find any evidence of a correlation.

Grabbing Data

I went out to Data.gov and started clicking around until I found some data on Consumer Complaints that I thought could be used as a measure of social mood, i.e., people complain more when they are unhappy and less when they are happy.

Using some of Jerry’s code and my own, I pulled both the S&P and Consumer Complaints data into my working environment.

Cleaning Up the Data

I ran str() on both sets of data to inspect their structures, which showed me that in both data sets the date field was read as a Factor rather than as a Date, and the dates were in different formats. I also had a lot more data in the Consumer Complaints file than the S&P 500 file, so I wanted to trim down the Consumer Complaints data.

I Googled all of the above code. Variations of “convert date” and “subset by date” were my search terms, and I clicked around on the first few hits till I found someone else’s answer to an already asked similar question that made sense to me.

Matching, Aggregating and More Subsetting

Now that I had my dates looking good, I wanted the column names of the two data sets to match and have the total number of complaints per date, which was not directly in my data, since the Consumer Complaints lists each complaint filed on a particular date.

Again, I Googled to get the above code. I didn’t quite remember how to rename a column, and I’ve never used the dplyr library before for aggregating data. Again, I clicked around on the first few sites looking for examples that made sense. The usual sites that come up are Stack Overflow, R-Bloggers, CRAN documentation pages…though, I rarely use the CRAN pages. They are useful when I need a refresher on a function, but when seeing one for the first time, I need examples.

Moving back to the S&P 500 data I did a quick subset of just the columns I wanted. I’m pretty sure I’ve seen something like the select() function in my tutorials, but I actually stumbled upon this method when I was searching for code on some of the other above steps. I have a decent SQL background, so this method seemed clean and elegant to me. It will become my go-to way of selecting columns from data sets.

Choosing the Model

At this point I was feeling ready to rock n’ roll—my data sets were in, munged, and aggregated. Since I wanted to see if there was any correlation between the number of complaints on a given date and where the S&P 500 closed, I chose simple linear regression as my evaluation model.

My first step was to get the relevant columns from my two data sets into one data frame linked on the date. I used merge() to do this, but had to look up the syntax. Then, I converted the date field class back to Date. While earlier I had converted the class of the Date columns from Factor to Date to correct the formats and subset the data, I then chose to return them Factor, because Juypter, the practice environment I was using, doesn’t seem to display dates correctly. If I had been working in RStudio, I would have left everything as Dates from the beginning. Then I used complete.cases() to remove any dates with values missing from either the Complaints or Close columns. I had seen this before, but I had to remind myself how to use it.

Plotting the Relationship

xyplot

 

I chose to plot the data to see if there was any correlation between my data sets. This took a fair amount of Googling to find a function that I liked and made sense. The thing with R is there are hundreds, if not thousands, of packages, and many do the same things. I rejected a few other methods because they seemed too complicated to me. Although, the lattice package may have been overkill for this task. That being said, I learned that it can do some pretty cool graphs, and I’ll be tapping into this package again, I’m sure. Unfortunately for my hypothesis, based on the xyplot, my interpretation is, “Not really any correlation here.”

Had I been working on an actual task and not a practice exercise, I would have stopped here. Either my assumptions are wrong, my data is wrong, I need to further transform the data or something else. My plot looks too scattershot for me to continue as is. However, this is a practice exercise, and I was playing around anyway.

Simple Linear Regression in R

I ran the linear regression function from R, and mostly came away with the feeling that there isn’t a significant relationship between the Consumer Complaints and S&P Close — and that I need a refresher in statistics.

Replot, Separately

At this point I felt I had missed a step. I didn’t plot my data sets individually first; I jumped right into the xyplot. So I decided to graph each on its own.

This took me longer than anything.

sp-plot

At first I wanted to plot them on the same axis, for comparison. I found several ways to do similar things, but none fit quite right. Then I realized that nothing fit quite right, because, while my x-axes were the same, my y-axes were too different. The functions I found were fine; my logic was not. I finally settled on plotting the data sets separately with ggplot(), a common plotting function in R.

Incremental Changes are Key

After figuring out how to add the trend lines to the plots, I stopped fiddling. I don’t know how much time exactly I put into the project — a couple hours the first afternoon, maybe two more that night, and then a little bit more the next morning — so, maybe four to five hours in total.

The Art of the Hack is about making incremental changes that modify existing code to suit your needs. Had I attempted to sit and write a linear regression algorithm based on data from Data.gov and Yahoo Finance all on my own, I would have failed.

However, with a moderate amount of learning invested in R, a few hours to fiddle and the Art of the Hack, I didn’t fail. I wrote a simple linear regression algorithm.


Jerry Overton — Distinguished Engineer

Jerry Overton is head of Advanced Analytics Research in CSC’s ResearchNetwork and the founder of CSC’s FutureTense initiative, which includes the Predictive Modeling Research Group, the Advanced Analytics Lab and the Predictive Modeling School.

See Jerry’s bio.

Logan Wilt — Skills Taxonomy Architect

Logan Wilt discovered Data Science four years ago and fell in love, because it allows her to ask more questions and find answers faster. She has been with CSC for almost 6 years and currently supports the Integrated Workforce Management Center of Excellence as the Skills Taxonomy Architect.

See Logan’s bio.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: