Saturday, August 16, 2014

S2DS and python, etc. - Pt 2

Week 2 of S2DS is down, and this week we began working full-time on our projects. As I mentioned before, my team has some proprietary data to work with. We’ll be combining offline shopping habits with online social network preferences to try to build models about consumer demographics and behaviors. As of now we’ve just been looking at the data to get a feel for what is in there, and we’ve started to brainstorm some ideas about what we can do with them.

Our team only has three members. One is working with the offline data, while I have partnered with the other to work with our Facebook data. We are currently using a python package called GraphLab to do some modeling and analysis of the data. To begin, we wanted to see if we could create a movie recommendation engine based on the movies that users had ‘liked’ on their Facebook profiles.

The GraphLab tool makes implementing machine learning techniques so stream-lined and black-boxed that it is almost trivial. Almost. And, actually, as a scientist, that is extraordinarily frustrating. I’m dying to get into the details about the algorithms and theory behind this stuff. That, as you might expect, is non-trivial. At a basic level, you don’t need to know all that stuff, but if you want to understand what is going on and how best to use the tools, you had better know it. As a scientist I can’t stand using a tool if I don’t know how it works. I can’t defend my work if I don’t know what it is that I actually did - that is fundamental to science. The good news is that everyone here understands that.

The bad news is that I still only get a couple weeks to dive into all of it. It turns out that industry/business works on a much shorter timescales than academia. That’s what we are told, at least. For now, I am trying to learn as much as possible. But, as with any new endeavor, you can only learn so much in a short amount of time. I know I won’t be an expert after this program, but I will have a great deal more knowledge. (Aside: One thing I learned from working in the CA Senate - if you know one thing about anything that no one else knows, then you are the expert in the room.)

I’ve been getting more comfortable with programming in python now. After taking the Udacity courses and playing around on my own, I feel like I can understand many of the basic structures that python is using. Even though I don’t know a lot right now, I am more confident in my ability to quickly learn python and its various packages (pandas, numpy, scipy, graphlab, etc.) now that I have more experience with them - and that is going to be really important going forward in my career.

At the end of this week, we (my S2DS team and I) have created a basic recommendation function for movies based on people’s Facebook likes. Almost immediately I’ve learned a good lesson about testing code. As I mentioned above, we are using a python package that works a lot like a black-box. So, to figure out how it works, we have to test various inputs and see if their output makes sense. For example, we’ve tried generating fake users that are copies of our real users to see if we get consistent recommendations for the invented data. At first, we didn’t, which indicated we were doing something wrong. So, the value of testing code is very high. Our background in science has given us this attention to systematics and details, so that we understand what it is exactly that we are doing, and we can produce consistent results.

We are planning to expand on our work by increasing complexity. We would like to incorporate other variables like music and book preferences as well. Hopefully we’ll be able to take in a given set of music preferences and recommend both additional music and movies based on other users’ preferences. Collectively this is sometimes referred to as a look-alike algorithm.

A lot of the work that I’ve just described is just exploratory. Next week we are going to have to focus some more on combining our data sets and doing some real hypothesis testing. In the end, we are hoping to develop an/some algorithm(s) that will actually be useful. I think I will be a very strong candidate for data science jobs by the end of the program.
 
Kyle D Hiner
Studious data scientist

No comments:

Post a Comment