Arbitrary Notations: Data Science is Not Magic

We’re at the end of August and close to the end of the S2DS program that I’ve been attending in London. The program has been great, and I have learned much more in a month than I would have simply by investigating on my own. I have had a good team, and I feel like we accomplished something useful.

As I mentioned in previous posts, my team is focused on unifying offline shopping data (specifically from grocery stores) and online data from facebook. I won’t tell you the actual results that we found - for that you’ll have to sit in on our presentation on Tuesday afternoon. But, I will describe some of the methods that we’ve used.

We investigated whether users who were similar to one another in one data set could be recovered by looking at their habits in the other data set. This is referred to as a Look-Alike Engine. We went about it by creating similarity matrices that assign a score to each user-pair based on how similar they are to one another using the input data. For shopping habits that means comparing users who bought similar items. For the social network data that means comparing users who have liked similar content.

There are a couple of different metrics one can use to determine similarity between two sets. A simple metric is the Jaccard Index, defined as the number of items in common between two sets divided by the number of total items. We decided that this scoring metric was not producing the results that we wanted. If two users each have missing data, then they are scored as similar. This somewhat depends on how the data sets are constructed and how the code runs, but it can very easily count a ‘0’ in one user’s data as the same as a ‘0’ in another user’s data. We might not want that in the end.

Another metric is the ‘cosine’ similarity score. This requires one to vectorize the data in a consistent way, because the cosine metric is defined as the dot product of the two user’s data normalized by the product of the magnitudes. Recall that the dot product can be written: A dot B = A*B*cosine(theta). So this method simply calculates the value of cosine(theta). At first I found this concept to be slightly difficult to apply to “shopping habits” and “likes” from facebook. However, it is fairly straightforward once you organize the data into a vector.

For example, with the movie ‘likes’ from facebook, we created an array/vector where every element is a possible movie title. We populated the array/vector with a value of 1 if the user ‘likes’ the title, or a zero if the user has not ‘liked’ that title. Then we had an array with all possible dimensions (movie titles) for every user. The cosine metric is simply the dot product of the arrays. For our data this meant that if two users did not declare a ‘like’ for a given movie, then they would not be considered similar - we don’t actually know if they liked the movie or if they didn’t like it. Not declaring a preference is not the same as dis-liking a thing.

Once we measured the similarity between users with a given metric, we organized the data into a large similarity matrix. This matrix simply contained all the similarity scores for every user pair, and allowed us to find the users who are most similar to one another.

Finally, we could test whether or not users who were similar in one data set were also similar in another. We selected the most similar users in one set to use as a control sample. Then in the other set we selected the most similar users and compared the two resulting selections of users to see how many we recalled and to what precision.

Recall and precision are two metrics by which we measured the ability of our method to find similar users. Recall is defined as the number of true positives in the set divided by the total number we could possibly find (the true positives plus false negatives). Precision is defined as the number of true positives divided by the number you’ve actually selected (true positives plus false positives). Depending on your application you might want to sacrifice some precision in order to obtain better recall, or vice versa.

Lastly, in order to determine if our method was actually any good, we conducted a simple monte-carlo simulation where we randomly selected users from the experimental data set to compare to the control set. In the end, we were able to determine the recall and precision for random selections and for our selections using the similarity scores as a function of the number of users in the sample selection.

Now if there actually is a correlation between people’s shopping habits and the movies and TV shows that they like on facebook, then we expect to more accurately select users from our similarity matrix than we would just by selecting at random. As I said, I won’t tell you the result, but you can hazard a guess on your own.

Coming up we will present our results to other participants and partner companies on Tuesday afternoon. Then on Wednesday afternoon there will be a job fair where will will be able to discuss opportunities with the partner companies.

As I said, I have found this experience very useful, and now I have a definite project that I can point to - with results - that shows what I can do as a data scientist. I am hoping that I will be able to develop more skills in data science and machine learning techniques as I move forward with my career.

Kyle D Hiner

Arbitrary Notations

Monday, September 1, 2014

Data Science is Not Magic

No comments:

Post a Comment