In late February of 2011 Google made a major adjustment to the way their search engine ranks pages. Originally named “Farmer” by the webmaster community the update was actually renamed by Google itself as “Panda” and has gone through 3 phases so far. Panda 1 in February, Panda 2 in early April and Panda 2.1 in early March.
The update is aimed at diminishing the reward for running a type of website known as a content farm. These websites generate hundreds of thousands of pages of content, typically via user contribution. These pages offer help on a variety of topics and generally are optimized by the website itself or the users to generate higher rankings in search engines. The websites, and sometimes the user, make revenue off of this content.
Many times the content is duplicated from other websites in a process known as ’scraping’ or the content is ’spun’ that is to say some words are replaced with versions from a thesaurus. The first update hit a few well known blogs, including one called Cult of Mac. While Google’s PR team was busy heralding the update as a focus and commitment on quality many webmasters and SEO experts were also cheering on the update as it largely removed these content farms from higher rankings and allowed smaller sites to rank higher. The only website not affected that many thought should be was eHow, a website founded by ex-employees of Google. Their IPO under the company ‘Demand Media’ brought in millions just weeks earlier and valued the company at $1.5 billion.
When Panda 2 came out, the webmaster community turned bitter with many claiming loses of 20-70% of their traffic from Google, costing billions in ad revenue to be shifted from smaller websites to larger websites that are members of the Online Publishers Association, the OPA even claims that it was a deal with Google that led to this change.
Google’s Panda 2.1 update seems to have affected few if any. The rising tide of dissent has begun to drown out Google and the SEO’s close to Google’s team that have been fairly quiet or have been republishing Google’s clear PR material. To answer that rising tide Google made a blog post in April written by one of the developers of the Panda update. This blog post only fueled more anger and speculation that so far Google has been unable to put out.
Now that I’ve got you all brushed up let’s dive into the updates. Panda 1 had a clear and apparent target, content farms. It seemed to work for the most part hurting the rankings of content farms like EzineArticles and Suite101. A few webmasters grumbled and complained claiming that their rankings dropped, some theorized that content that linked to their website from these affected domains lost link value, a theory that continues.
Here is how I assumed Google may have developed the Panda 1 update. First - check a website to see if it has a LOT of content (amount = x). This is easy to do for Google since they know how many of your pages they index.
Then catagorize the content of those pages. For example say “is this an article on a subject” or “is this text describing something on the website”. For example text on a website can be used to tell users about a value proposition offered by a company. On an ecommerce site the phrase “read reviews from buyers” would be something like to be found and not an article. I’m not so sure on how Google would do it, but as a layman myself I would count the number of characters in nearby blocks. That is to say, if there are 1,000 characters a few lines of code away from another 1,000 characters and it seems to be text I would consider that two paragraphs. I’ve also heard suggestions that they could use semantically correct code to say “this is in a P tag, it must be article based content” and “there are several P tags with content in a row, this must be an article”.
Once the algorithm determines the type of content (article versus content) it can then perform analysis specific to that type of content. In fact in Amit’s blog post (mentioned above) you can see that there is specific language used on bullet points about “articles” themselves and other bullets about “content”, “page” “ads” and the “site”.
For example in Panda 1 Google could say easily that 800 words or an estimated 4,000 text characters could be counted as an article. So on that content we would want to check spelling and grammar, check for the author information, check to see if the article is duplicated elsewhere on the site or on the web.
Then the udpate could take action if it determined the content on the domain was ‘low-quality’. Action could be pagerank adjustment, deindexing, lowering the rank, or a placing penalty on the entire domain.
Panda 1, for what it seems was probably easy to do. It’s definitely content based and definitely targeted at english language websites, which is why the update was restricted to the U.S. search results only the first time.
Panda 2 according to Google “went deeper into the longtail” a cryptic definition at best. Google claimed that this update would only affect 2% of websites in the USA and extend the algorithm to other English speaking nations around the globe. But the update sent hundreds, if not thousands of webmasters to Google’s support forums demanding to know why they were labeled as ‘low quality’. Many times there was no answer, other times Google staff would claim duplicate content issues.
Duplicate content, until Panda, has always been handled by Google who has sent EXPLICIT instructions to webmasters over the years in their own videos, appearances at conventions and via interviews with SEO’s to ignore duplicate content and allow them to make the adjustments on their end. With Panda that clearly changed, but is it judiciously done? That answer should be clear to anyone who knows Google’s business model. As Aaron Wall over at SeoBook points out, there are several examples where Google encourages duplicate content on their own websites and earns millions in profit from it. One that Aaron missed is “Google Hosted News” which hosts news from the Associated Press. The AP is a great example of duplicate content through the ages being useful for users. In Google’s one blog post on how to avoid the Panda algo they say “Does the article provide original content or information, original reporting, original research, or original analysis?” and “Does the article describe both sides of a story?” First off the AP does not describe both sides of the story, and many webmasters and experts are of the thought that duplicate content across domains is also a part of the Panda updates, although not mentioned specifically by Google but mentioned by the OPA (referenced above). Today if you searched for ‘iceland’ on Google News the top result is Google’s hosted news AP article on the subject. If you take a portion of that article and run it on Google’s own search engine you will find 15,700 results, but none of them rank over Google? Did Google do original research? Are they an expert on the subject of Iceland erupting? No they are not, and here they are ranking #1 for an article basically scraped from the AP’s website.
In my own research (sorry I cant release details due to an NDA) it appears that when it comes to cross-domain duplicate content Google is ranking one of the websites on page 1 and pushing others with the same content as far back as page 6. Google itself stated that they were not deindexing content, but adjusting how it ranks. Google is also trying to do source attribution, who originally wrote the article, but they are currently TERRIBLE at it. The page ranking for the duplicated content is usually a high pagerank website. But why is this such a bad thing? There are numerous ‘innocent’ uses of duplicate content like product descriptions, real estate listings, author biographies, etc..
So for example if you sell the same stereo as Amazon.com or Crutchfield then you can’t use the manufacturers description as Google will attribute the description to their website and not realize that it’s a useful duplication that will benefit consumers.
You can repeat the test I did and see if the results are the same to prove or disprove if Panda is just picking one website with copied content and pushing the rest back.
- Search for a product
- Copy the description content into Copyscape or Google (your choice)
- Record the domains that use the content
- Search for the product again and record the ranking positions (while not logged in) for each website that uses the content. Also make sure to record the metrics of the domain (i used PageRank of page, PageRank of domain and page authority - as tracked by seomoz) (you might also track direct links to deep content).
- Then decide how difficult ranking on each term is, using your own definitions and organize your results in these different difficulty rankings (i did 1-10 where 1 equaled a made up word and 10 equaled ‘google’)
- Now see what ranks in between the pages with the duplicate product descriptions.
So if you’re a smaller website and you’ve been clawed by Panda what should you do? In my personal opinion I would delete anything non-unique or original on your pages. If you perform the above test you might see that websites with just a title, price and some reviews will rank for a product with no description above those that have a description.
If you REALLY want to rank for something sit down and write out your own personal description and get links to that product. Google: “Does the page provide substantial value when compared to other pages in search results?”
Build trust and show trust (seals) like Truste and others might help and could improve conversion rates as well. Google: “Would you be comfortable giving your credit card information to this site?”
And my final piece of advice is to leverage other search engines (blekko, bing), social media, and offline advertising campaigns. Do not put all of your businesses hopes in Google’s algorithm. This is just the beginning and it appears it will get much more difficult for smaller websites with lots of pages to compete on scale with larger, deeper pocket companies.