I have just finished an interesting project for a client. They run a network of news and analysis sites and have a custom search engine. One of the complaints about the search was that news and factual information was being missed because of a wealth of opinion and editorials. I was brought in to try and come up with a way of identifying if an article was factual, opinion driven or editorial (factual that has an opinion), and to provide a ranking factor for the three. The idea was that a full-on opinionated rant should appear lower in the results than a factual news story.
Now, there is no such thing as un-opinionated article. Everyone will have a view, even if they try to not be biased, so the first and obvious question is: When should opinion count?
What Is an Opinion?
Well, a quick visit to dictionary.com brought 2 out 7 definitions for ‘opinion’ worth considering:
- a belief or judgment that rests on grounds insufficient to produce complete certainty.
- a personal view, attitude, or appraisal.
The words I think are important are judgment, view and appraisal. These lead us to right, wrong, negative and positive, so we could perhaps say an opinion, in context of a written prose, is something that is skewed towards negative or positive wordings.
Tim is an utterly amazing and cool guy. I think everyone should follow him on twitter because he is awesome.
Can be safely assumed to be an opinion, as could
Tim is a miserable, boring, depressing person no one should follow!
But it’s not long before Storm gets started:
“You can’t know anything,
Knowledge is merely opinion”
She opines, over her Cabernet Sauvignon
Vis a vis
Some unhippily
Empirical comment by me“Not a good start” I think….
I resist the urge to ask Storm
Whether knowledge is so loose-weave
Of a morning
When deciding whether to leave
Her apartment by the front door
Or a window on the second floor.
Tim Minchin – Storm
But what about:
Placing the L45 into Warp drive mode will cause it to fail resulting in a negative feedback in the pinky.
Here is a something that, while negative words are used, could not be described as an opinion, assuming the warp drive does indeed fail in those circumstances. Still, if we assume an opinionated piece will have a higher concentration of negative or positive words per word count, we can create a simple rule for determining if text is opinionated or not.
How To Determine Positive or Negativity
Let’s face it, certain words are positive and others are negative. Actually, as humans, we can quickly work out which are which easily enough. Computers, however, need a little more help, so we have to provide them with a set of words and identify if they are positive or negative. Then, we simply process through them and work out if the line is positive, negative, or neutral on a line by line basis. Simples.
Ok, so to do this, we could simply bucket count, but that’s not very efficient. Instead, we’ll use the Bayes Theorem :
P(A|B) = P(B|A)*P(A) / P(B)
Where P is Probability. Where A and B are events and A|B is where event A occurs if B is true.
So, Bayes theorem is calculating the probability that “A” is true or will be true, given a certain set of circumstances “B”. For more information, see Bayes Law in Plain English.
With this math under our belt, we can construct a set of Baysian classifiers. These will sound familiar because they are what a lot of anti spam filters use. In our case, they allow us to process through and look for negative and positive words and phrases. With the classifier in place and a suitable set of positive and negative word lists, we can begin processing through the documents.
While the client was only interested in opinions, we actually stored the total number of sentences with negative or positive opinions. The code for our initial version was heavily influenced by Baysian opinion mining code on PHPIR. However, we did end up rewiting the code in C++ as a php module to speed up the results for our large data sets. In addition, we took a similar method to Darko Romanov and filtered the stop words from our sentences.
Once we processed through document, sentence by sentence, we determined an overall score. This score was then used, along with the total number of sentences, to determine how opinionated a piece was. We also showed if it had a positive or negative bias and compared it to 100 examples of what was deemed to be opinionated and 100 that were not.
The system correctly identified the one hundred opinionated pieces, but also incorrectly identified 12 of the non opinionated pieces. Tweaking of the Lexicon reduced this down to 4, which was deemed a reasonable error margin. However, it also let 1 opinionated piece through. Again, this was deemed acceptable. The goal now is to provide multiple lexicons dependent on the site and author who is writing the piece.
Does It Cope With Sarcasm?
Surprisingly well! Most sarcasm is used when a positive indicator is used. When in fact a negative is inferred, most sarcasm is surrounded by other negative sentences when the system breaks down content sentence by sentence. Thus, while the sarcastic sentence itself will indeed be misclassified, the surrounding sentences will not (hopefully), and so there would be more negative then positive sentences within the piece.
What Other Uses In Search
Well, obviously, in the original client request, they were looking at removing or reducing relevancy of opinionated pieces, but imagine if you did the opposite. Let’s say I run a review website that people can use to search for product reviews. Obviously, I link to products with affiliate links.
Now, imagine if my internal search was designed to show more positive results for higher commission or converting items? Negative reviews would be found further down the list of product reviews. Sneaky, but if you have lots of people using your internal search it could be one way to increase revenue.
Should it be a ranking factor?
Well, I suspect many site owners of products may wish it was a factor. Take, for example, my post on GoParks recently. The owner of the site would probably be quite keen if opinion pieces had reduced rankings. I can also see for Google News or similar this style of ranking could be useful, but on the main stream web I’m not convinced that it would work.
What do you think, should opinion be a ranking factor?
5 comments
Its interesting that your pro-actively addressing the point Tim and I think you have your work cut out for you in disseminating the difference between fact and opinion when (Im my opinion ;P) the two are one and the same.
That means I’m on the side of Opinion outranking Fact I guess.
In my (somewhat insane and brightly coloured) little world there are no such things as facts. Nothing has such complete permanence in either truth or reality that it can be counted on to remain consistent for eternity and never change.
All Facts evolve and as such that means that no facts in actuality exist. They are simply the accepted opinion of that time or considered to be FACT because a majority may hold them as the basis for some other warped or misguided belief.
Therefore, to stick to the point, opinion is as much fact as accepted facts are and only with a combination of them all can we decide which opinion is most acceptable or logical as a fact and to display this in search results and mark them up as one or the other is a misrepresentation as this in itself is the “opinion” of the search engines algorithm in taking the factors and results into account!
I simply requote…
You can’t exact argue with gravity, well you can briefly until you hit the ground, likewise there are always known constants, speed of light in a vacuum. You could make the case that a fact is a proved opinion but should I trust the opinion of a guy wearing feathers on his arms who declares gravity is no more and he can fly as he leaps of said cliff?
Gravity is only a proven opinion its true. What`s true here wont make it fact in other places though, like on the moon, so while it exists in some form that dosent mean it is a constant “fact” just cause we experience it at this moment in this location.
If the Guy wearing the feathers actually jumps then what happens next is that he is going to die. Unless there are other factors involved that I’m not aware of to prevent this fact. Like a parachute, or a bouncy castle or a tea cup of luke warm water. Hell, maybe he will float upwards anyway cause he simply believes in it.
that doesn’t hold up, we know the effect of gravity not only here and on the moon, but in any circumstances, given we have known constants (lets call them um facts) about the environment and we have ways to measure those facts.
We need to give the facts values we can share does not diminish them or make them any less concrete simply we need to give them a name and a value to discuss them.
The person falling off the cliff given enough facts and data we can give the exact force he will hit the ground.
i can say that opinion is a better factor for ranking. I agree with you Tim