I have just finished an interesting project for a client. They run a network of news and analysis sites and have a custom search engine. One of the complaints about the search was that news and factual information was being missed because of a wealth of opinion and editorials. I was brought in to try and come up with a way of identifying if an article was factual, opinion driven or editorial (factual that has an opinion), and to provide a ranking factor for the three. The idea was that a full-on opinionated rant should appear lower in the results than a factual news story.
Now, there is no such thing as un-opinionated article. Everyone will have a view, even if they try to not be biased, so the first and obvious question is: When should opinion count?
What Is an Opinion?
Well, a quick visit to dictionary.com brought 2 out 7 definitions for ‘opinion’ worth considering:
- a belief or judgment that rests on grounds insufficient to produce complete certainty.
- a personal view, attitude, or appraisal.
The words I think are important are judgment, view and appraisal. These lead us to right, wrong, negative and positive, so we could perhaps say an opinion, in context of a written prose, is something that is skewed towards negative or positive wordings.
Tim is an utterly amazing and cool guy. I think everyone should follow him on twitter because he is awesome.
Can be safely assumed to be an opinion, as could
Tim is a miserable, boring, depressing person no one should follow!
But it’s not long before Storm gets started:
“You can’t know anything,
Knowledge is merely opinion”
She opines, over her Cabernet Sauvignon
Vis a vis
Empirical comment by me
“Not a good start” I think….
I resist the urge to ask Storm
Whether knowledge is so loose-weave
Of a morning
When deciding whether to leave
Her apartment by the front door
Or a window on the second floor.
Tim Minchin – Storm
But what about:
Placing the L45 into Warp drive mode will cause it to fail resulting in a negative feedback in the pinky.
Here is a something that, while negative words are used, could not be described as an opinion, assuming the warp drive does indeed fail in those circumstances. Still, if we assume an opinionated piece will have a higher concentration of negative or positive words per word count, we can create a simple rule for determining if text is opinionated or not.
How To Determine Positive or Negativity
Let’s face it, certain words are positive and others are negative. Actually, as humans, we can quickly work out which are which easily enough. Computers, however, need a little more help, so we have to provide them with a set of words and identify if they are positive or negative. Then, we simply process through them and work out if the line is positive, negative, or neutral on a line by line basis. Simples.
Ok, so to do this, we could simply bucket count, but that’s not very efficient. Instead, we’ll use the Bayes Theorem :
P(A|B) = P(B|A)*P(A) / P(B)
Where P is Probability. Where A and B are events and A|B is where event A occurs if B is true.
So, Bayes theorem is calculating the probability that “A” is true or will be true, given a certain set of circumstances “B”. For more information, see Bayes Law in Plain English.
With this math under our belt, we can construct a set of Baysian classifiers. These will sound familiar because they are what a lot of anti spam filters use. In our case, they allow us to process through and look for negative and positive words and phrases. With the classifier in place and a suitable set of positive and negative word lists, we can begin processing through the documents.
While the client was only interested in opinions, we actually stored the total number of sentences with negative or positive opinions. The code for our initial version was heavily influenced by Baysian opinion mining code on PHPIR. However, we did end up rewiting the code in C++ as a php module to speed up the results for our large data sets. In addition, we took a similar method to Darko Romanov and filtered the stop words from our sentences.
Once we processed through document, sentence by sentence, we determined an overall score. This score was then used, along with the total number of sentences, to determine how opinionated a piece was. We also showed if it had a positive or negative bias and compared it to 100 examples of what was deemed to be opinionated and 100 that were not.
The system correctly identified the one hundred opinionated pieces, but also incorrectly identified 12 of the non opinionated pieces. Tweaking of the Lexicon reduced this down to 4, which was deemed a reasonable error margin. However, it also let 1 opinionated piece through. Again, this was deemed acceptable. The goal now is to provide multiple lexicons dependent on the site and author who is writing the piece.
Does It Cope With Sarcasm?
Surprisingly well! Most sarcasm is used when a positive indicator is used. When in fact a negative is inferred, most sarcasm is surrounded by other negative sentences when the system breaks down content sentence by sentence. Thus, while the sarcastic sentence itself will indeed be misclassified, the surrounding sentences will not (hopefully), and so there would be more negative then positive sentences within the piece.
What Other Uses In Search
Well, obviously, in the original client request, they were looking at removing or reducing relevancy of opinionated pieces, but imagine if you did the opposite. Let’s say I run a review website that people can use to search for product reviews. Obviously, I link to products with affiliate links.
Now, imagine if my internal search was designed to show more positive results for higher commission or converting items? Negative reviews would be found further down the list of product reviews. Sneaky, but if you have lots of people using your internal search it could be one way to increase revenue.
Should it be a ranking factor?
Well, I suspect many site owners of products may wish it was a factor. Take, for example, my post on GoParks recently. The owner of the site would probably be quite keen if opinion pieces had reduced rankings. I can also see for Google News or similar this style of ranking could be useful, but on the main stream web I’m not convinced that it would work.
What do you think, should opinion be a ranking factor?