It’s been a few weeks since my last blog post; I am back in Chile, and there is a lot to discuss. But I’ll save you from the majority of it.

In September, I spent two weeks visiting Yale, which is a great place for astronomy. The department there is a good size, and there are always events to attend - talks, journal clubs, student/post-doc workshops, etc. This is pretty different from my previous experience at universities that had smaller departments. There is just a better sense of community once you get a department beyond some critical size.

The department also has a very positive attitude in regards to the academic culture. For me, the attitude of the community makes a world of difference. When I was a first-year graduate student I was surrounded by a lot of negativity. That continued into my second, and third year. It has taken me a very long time to work out of that. Having a more positive attitude about academia and astronomy gives me more motivation to keep working on things. It's a feeling I get when I attend astronomy meetings/conferences. And I’m pleased to report that I’m working on revising a paper that, hopefully, will get accepted for publication soon.

Also in the past month, I’ve had an undergraduate from UdeC emailing me requesting to work with me. I was hesitant at first, because I haven’t been sure where I would be in the near future. I decided to give him some reading, and now I am actually excited about giving him a project. We met in-person last week, and he said he wanted to work on some data. I think I have a mental sketch of a decent undergraduate project where he will get some experience and learn something useful. But now that I’ve taken him on as a student, it means more work and responsibility for me. This is all a good thing, because it means I continue to progress in my career.

I also had a conversation with my current faculty sponsor. We decided that it might make sense for me to apply to some faculty positions in order to help solve the Two-Body Problem. This is going to be very interesting; I’ve never applied to a faculty position before. After thinking about it for a couple days, I’ve convinced myself that I would make a good candidate, and I hope that my application(s) will reflect that.

For a while I have been pretty conflicted about how and where I might continue in my professional career. But since attending the data science workshop in August (a very positive experience) and since visiting Yale for another two weeks (another positive experience), I am a little less worried about where I may end up. I feel a renewed attitude, and I expect that whatever job/career opportunities arise will be rewarding.

## Sunday, September 28, 2014

## Monday, September 1, 2014

### Data Science is Not Magic

We’re at the end of August and close to the end of the S2DS program that I’ve been attending in London. The program has been great, and I have learned much more in a month than I would have simply by investigating on my own. I have had a good team, and I feel like we accomplished something useful.

As I mentioned in previous posts, my team is focused on unifying offline shopping data (specifically from grocery stores) and online data from facebook. I won’t tell you the actual results that we found - for that you’ll have to sit in on our presentation on Tuesday afternoon. But, I will describe some of the methods that we’ve used.

We investigated whether users who were similar to one another in one data set could be recovered by looking at their habits in the other data set. This is referred to as a Look-Alike Engine. We went about it by creating similarity matrices that assign a score to each user-pair based on how similar they are to one another using the input data. For shopping habits that means comparing users who bought similar items. For the social network data that means comparing users who have liked similar content.

There are a couple of different metrics one can use to determine similarity between two sets. A simple metric is the Jaccard Index, defined as the number of items in common between two sets divided by the number of total items. We decided that this scoring metric was not producing the results that we wanted. If two users each have missing data, then they are scored as similar. This somewhat depends on how the data sets are constructed and how the code runs, but it can very easily count a ‘0’ in one user’s data as the same as a ‘0’ in another user’s data. We might not want that in the end.

Another metric is the ‘cosine’ similarity score. This requires one to vectorize the data in a consistent way, because the cosine metric is defined as the dot product of the two user’s data normalized by the product of the magnitudes. Recall that the dot product can be written: A dot B = A*B*cosine(theta). So this method simply calculates the value of cosine(theta). At first I found this concept to be slightly difficult to apply to “shopping habits” and “likes” from facebook. However, it is fairly straightforward once you organize the data into a vector.

For example, with the movie ‘likes’ from facebook, we created an array/vector where every element is a possible movie title. We populated the array/vector with a value of 1 if the user ‘likes’ the title, or a zero if the user has not ‘liked’ that title. Then we had an array with all possible dimensions (movie titles) for every user. The cosine metric is simply the dot product of the arrays. For our data this meant that if two users did not declare a ‘like’ for a given movie, then they would not be considered similar - we don’t actually know if they liked the movie or if they didn’t like it. Not declaring a preference is not the same as dis-liking a thing.

Once we measured the similarity between users with a given metric, we organized the data into a large similarity matrix. This matrix simply contained all the similarity scores for every user pair, and allowed us to find the users who are most similar to one another.

Finally, we could test whether or not users who were similar in one data set were also similar in another. We selected the most similar users in one set to use as a control sample. Then in the other set we selected the most similar users and compared the two resulting selections of users to see how many we recalled and to what precision.

Recall and precision are two metrics by which we measured the ability of our method to find similar users. Recall is defined as the number of true positives in the set divided by the total number we could possibly find (the true positives plus false negatives). Precision is defined as the number of true positives divided by the number you’ve actually selected (true positives plus false positives). Depending on your application you might want to sacrifice some precision in order to obtain better recall, or vice versa.

Lastly, in order to determine if our method was actually any good, we conducted a simple monte-carlo simulation where we randomly selected users from the experimental data set to compare to the control set. In the end, we were able to determine the recall and precision for random selections and for our selections using the similarity scores as a function of the number of users in the sample selection.

Now if there actually is a correlation between people’s shopping habits and the movies and TV shows that they like on facebook, then we expect to more accurately select users from our similarity matrix than we would just by selecting at random. As I said, I won’t tell you the result, but you can hazard a guess on your own.

Coming up we will present our results to other participants and partner companies on Tuesday afternoon. Then on Wednesday afternoon there will be a job fair where will will be able to discuss opportunities with the partner companies.

As I said, I have found this experience very useful, and now I have a definite project that I can point to - with results - that shows what I can do as a data scientist. I am hoping that I will be able to develop more skills in data science and machine learning techniques as I move forward with my career.

Kyle D Hiner

As I mentioned in previous posts, my team is focused on unifying offline shopping data (specifically from grocery stores) and online data from facebook. I won’t tell you the actual results that we found - for that you’ll have to sit in on our presentation on Tuesday afternoon. But, I will describe some of the methods that we’ve used.

We investigated whether users who were similar to one another in one data set could be recovered by looking at their habits in the other data set. This is referred to as a Look-Alike Engine. We went about it by creating similarity matrices that assign a score to each user-pair based on how similar they are to one another using the input data. For shopping habits that means comparing users who bought similar items. For the social network data that means comparing users who have liked similar content.

There are a couple of different metrics one can use to determine similarity between two sets. A simple metric is the Jaccard Index, defined as the number of items in common between two sets divided by the number of total items. We decided that this scoring metric was not producing the results that we wanted. If two users each have missing data, then they are scored as similar. This somewhat depends on how the data sets are constructed and how the code runs, but it can very easily count a ‘0’ in one user’s data as the same as a ‘0’ in another user’s data. We might not want that in the end.

Another metric is the ‘cosine’ similarity score. This requires one to vectorize the data in a consistent way, because the cosine metric is defined as the dot product of the two user’s data normalized by the product of the magnitudes. Recall that the dot product can be written: A dot B = A*B*cosine(theta). So this method simply calculates the value of cosine(theta). At first I found this concept to be slightly difficult to apply to “shopping habits” and “likes” from facebook. However, it is fairly straightforward once you organize the data into a vector.

For example, with the movie ‘likes’ from facebook, we created an array/vector where every element is a possible movie title. We populated the array/vector with a value of 1 if the user ‘likes’ the title, or a zero if the user has not ‘liked’ that title. Then we had an array with all possible dimensions (movie titles) for every user. The cosine metric is simply the dot product of the arrays. For our data this meant that if two users did not declare a ‘like’ for a given movie, then they would not be considered similar - we don’t actually know if they liked the movie or if they didn’t like it. Not declaring a preference is not the same as dis-liking a thing.

Once we measured the similarity between users with a given metric, we organized the data into a large similarity matrix. This matrix simply contained all the similarity scores for every user pair, and allowed us to find the users who are most similar to one another.

Finally, we could test whether or not users who were similar in one data set were also similar in another. We selected the most similar users in one set to use as a control sample. Then in the other set we selected the most similar users and compared the two resulting selections of users to see how many we recalled and to what precision.

Recall and precision are two metrics by which we measured the ability of our method to find similar users. Recall is defined as the number of true positives in the set divided by the total number we could possibly find (the true positives plus false negatives). Precision is defined as the number of true positives divided by the number you’ve actually selected (true positives plus false positives). Depending on your application you might want to sacrifice some precision in order to obtain better recall, or vice versa.

Lastly, in order to determine if our method was actually any good, we conducted a simple monte-carlo simulation where we randomly selected users from the experimental data set to compare to the control set. In the end, we were able to determine the recall and precision for random selections and for our selections using the similarity scores as a function of the number of users in the sample selection.

Now if there actually is a correlation between people’s shopping habits and the movies and TV shows that they like on facebook, then we expect to more accurately select users from our similarity matrix than we would just by selecting at random. As I said, I won’t tell you the result, but you can hazard a guess on your own.

Coming up we will present our results to other participants and partner companies on Tuesday afternoon. Then on Wednesday afternoon there will be a job fair where will will be able to discuss opportunities with the partner companies.

As I said, I have found this experience very useful, and now I have a definite project that I can point to - with results - that shows what I can do as a data scientist. I am hoping that I will be able to develop more skills in data science and machine learning techniques as I move forward with my career.

Kyle D Hiner

Subscribe to:
Posts (Atom)