Saturday, August 16, 2014

S2DS and python, etc. - Pt 2

Week 2 of S2DS is down, and this week we began working full-time on our projects. As I mentioned before, my team has some proprietary data to work with. We’ll be combining offline shopping habits with online social network preferences to try to build models about consumer demographics and behaviors. As of now we’ve just been looking at the data to get a feel for what is in there, and we’ve started to brainstorm some ideas about what we can do with them.

Our team only has three members. One is working with the offline data, while I have partnered with the other to work with our Facebook data. We are currently using a python package called GraphLab to do some modeling and analysis of the data. To begin, we wanted to see if we could create a movie recommendation engine based on the movies that users had ‘liked’ on their Facebook profiles.

The GraphLab tool makes implementing machine learning techniques so stream-lined and black-boxed that it is almost trivial. Almost. And, actually, as a scientist, that is extraordinarily frustrating. I’m dying to get into the details about the algorithms and theory behind this stuff. That, as you might expect, is non-trivial. At a basic level, you don’t need to know all that stuff, but if you want to understand what is going on and how best to use the tools, you had better know it. As a scientist I can’t stand using a tool if I don’t know how it works. I can’t defend my work if I don’t know what it is that I actually did - that is fundamental to science. The good news is that everyone here understands that.

The bad news is that I still only get a couple weeks to dive into all of it. It turns out that industry/business works on a much shorter timescales than academia. That’s what we are told, at least. For now, I am trying to learn as much as possible. But, as with any new endeavor, you can only learn so much in a short amount of time. I know I won’t be an expert after this program, but I will have a great deal more knowledge. (Aside: One thing I learned from working in the CA Senate - if you know one thing about anything that no one else knows, then you are the expert in the room.)

I’ve been getting more comfortable with programming in python now. After taking the Udacity courses and playing around on my own, I feel like I can understand many of the basic structures that python is using. Even though I don’t know a lot right now, I am more confident in my ability to quickly learn python and its various packages (pandas, numpy, scipy, graphlab, etc.) now that I have more experience with them - and that is going to be really important going forward in my career.

At the end of this week, we (my S2DS team and I) have created a basic recommendation function for movies based on people’s Facebook likes. Almost immediately I’ve learned a good lesson about testing code. As I mentioned above, we are using a python package that works a lot like a black-box. So, to figure out how it works, we have to test various inputs and see if their output makes sense. For example, we’ve tried generating fake users that are copies of our real users to see if we get consistent recommendations for the invented data. At first, we didn’t, which indicated we were doing something wrong. So, the value of testing code is very high. Our background in science has given us this attention to systematics and details, so that we understand what it is exactly that we are doing, and we can produce consistent results.

We are planning to expand on our work by increasing complexity. We would like to incorporate other variables like music and book preferences as well. Hopefully we’ll be able to take in a given set of music preferences and recommend both additional music and movies based on other users’ preferences. Collectively this is sometimes referred to as a look-alike algorithm.

A lot of the work that I’ve just described is just exploratory. Next week we are going to have to focus some more on combining our data sets and doing some real hypothesis testing. In the end, we are hoping to develop an/some algorithm(s) that will actually be useful. I think I will be a very strong candidate for data science jobs by the end of the program.
Kyle D Hiner
Studious data scientist

Sunday, August 10, 2014

Science to Data Science, Pt 1

I’ve traveled to London for a 5-week program on data science techniques. The program is called S2DS (Science to Data Science), and so far it has been really great. There are about 80 participants, who are divided into small (~4 person) teams. Each team partners with a company for the month in order to complete a project that is relevant to the business and serves to teach the participant valuable data science techniques. I actually have an some open questions regarding intellectual property, but glossing over that . . .

We’ve just completed the first week, and we’ve covered a lot of material. It is going pretty fast, and I feel like I’ve been here a while already even though I still haven’t met all of the participants. The lectures have been rather superficial. One simply cannot realistically give an substantial introduction into any of these topics in the small time frames that we have. You can’t learn python in 2 hours, similar for Hadoop, SQL, and machine learning. That said, the introductory lectures are valuable. I am still trying to grasp all the concepts, but hopefully I will get much a more detailed experience as the program progresses.

The thing I am most excited about in this program is expanding my skill set. I’ve always been somewhat uncomfortable with my level of competency with various tasks and programming. I’ve previously written about trying to learn more python, including through the Udacity online coursework. This workshop will really be a time that I can dive into some new things that are applicable both in and outside of astronomy.

So far we have had lectures on good coding practices, Hadoop, Database introduction, SQL, NoSQL, statistics, R, natural language processing, machine learning techniques, and even some more business oriented seminars on economics and marketing. Meanwhile, we’ve also met with our project mentors to outline the project scope for our group work over the next few weeks. I am most excited to dive deeper into the machine learning techniques, because I feel like those are most useful across a wide variety of potential future jobs.

For my team’s project, we’ll be doing a lot of machine learning and trying to do some predictive analysis. The data set that we are getting access to is proprietary, so I’m not allowed to share information about it. But, generally, we’ll be working with customer data to predict purchasing habits and/or demographics of the customer. It is kind of crazy to think about the amount of information people are willingly (or unwittingly) handing over to various companies. Regardless, I am looking forward to learning the techniques of the project and potentially applying them in academia or other industries.

Kyle D Hiner
Data scientist?