[PYTHON] A story about displaying article-linked ads on Jubatus

Introduction

Speaking of which, I participated in Jubatus Hackathon # 1 last October.

http://connpass.com/event/8233/

The slides at the time of the announcement are neatly summarized here, so please refer to them.

http://blog.jubat.us/2014/10/jubatus.html

Even though I took it from planning to implementation in one day, I regret that it was a lot of work. So I'll add a little.

What I made

Shortly before the hackathon, I went to one of the largest advertising events in Japan. Inspired by that, let's make an advertising guy! When I told a junior of the same team, I got an OK, so I made it.

Specific specifications and assumptions are as follows:

――I have so-called owned media (article) and want to display advertisements that match it. ――Matching means that the content of the article and the content of the advertisement are similar. -(I don't want to use my head too much because it's quite hard after work)

Especially important is the bottom condition, which has a very large weight.

If you give a little serious reason, there is a self-admonition to make the structure as simple as possible if you get lost. This is because complex systems have higher maintenance costs. Especially when using machine learning, if there is a bug (or unpleasant behavior) in the system

--Middleware bug --Correct behavior by algorithm --Lack of learning data --Inappropriate features --Insufficient tuning

There are many things to consider, such as, so I don't want to work anymore (serious). Basically, engineers do automation to make things easier, so if you have more jobs, you'll be overwhelmed (?).

By the way, the code probably won't work as it is, but it's on github.

https://github.com/chase0213/jubatus-hackathon-01

Click here for the slides at the time of the announcement.

http://www.slideshare.net/chisatohasegawa370/jubatus-hackathon-1hiyoshi

System construction

As you may have noticed on github, we were initially trying to get a Rails + Elasticsearch + Jubatus 3 server configuration. However, the Elasticsearch part was not essential and was rejected. I wanted to use it. .. ..

The flow of the request is as follows.

--Owned media (article) is posted on Rails server --Ads are in json format (including title and ad summary), learned on jubatus --Users enter appropriate search keywords to search for articles --The Rails server sends the content of the narrowed down articles to the backend Jubatus (received by python + Django) by POST. --The Jubatus server that receives the request uses recommender to return the id of the advertorial article that is close to the content of the article to the Rails server. --The Rails server gets the ad based on the id received from the Jubatus server and displays it to the user

So after all, you're just digging into the recommender.

Then, the place to judge the similarity between the article and the advertisement is morphologically analyzed using mecab + ipadic and held as a feature using Jubatus' vector converter.

http://jubat.us/ja/fv_convert.html

This converter is so good that it doesn't do much to knead the features. Therefore, it is covered with noise and the accuracy is not so high, but this matter will be described later.

demo

I deployed it on AWS and demonstrated it. I just used a machine with reasonable specs, so I've stopped it now (for financial reasons). If you mess with the code properly, it will work locally, so please mess with it.

Concept

In the learning part of learning, where advertisements can be added dynamically, the taste of the learning device was not so good. Usually, many people imagine that machine learning gets smarter every time it is used, so I will briefly comment on that part (I will not write any code, not bad).

First of all, when it comes to using a learner in an advertising system, I think it's about making a difference in the impression rate between what was actually clicked and what wasn't. This means that you'll have more impressions for the most clicked article / ad combination, and whimsical for those that don't.

I haven't actually made it, but I think it can be done using a classifier. Specifically, take the UNION of the content of the article narrowed down by the search system and the content of the clicked advertisement, and plunge it into the classifier with the label that it was clicked. Similarly, if you don't click, you'll save the article and ad UNION with the label that it wasn't clicked, but one caveat here.

Even if a link isn't clicked, it's premature to determine that the user isn't interested in it.
Because, in many cases, users are offered many other options,
It did not actively indicate that he was not interested in the link.

I made it look like a quote, but it's my personal opinion. On the flip side, clicked information should be treated more strongly than non-clicked information.

So, if you tune this area statistically well and dig into the classfier for binary classification, you can predict whether the user is likely to click on that combination. If you return that value to the upper row (recommender) and reflect it well as a coefficient, more combinations will be displayed that are clicked a lot.

At the end

The last paragraph is a desk theory, so I think the pitfalls are scattered all over the place. Then I described it as "noisy" at the top, but this should be tolerated to some extent. That's because ads that are completely noise-free can be boring to users (ads are non-professional and should not be mentioned too deeply).