Feeding PageRank

At one time my Google PageRank™ used to be quite high – 6 or 7, without even trying. I guess I did something to annoy it over the years and my PR dropped to 0. So I decided to do a few things as a part of my new “Ego Drive”, not to be confused with the “Eco drive” that I still haven’t embarked upon. So far I have gotten around to doing just two:

  1. I registered with Technorati. And then I hit a kind of weird situation: Technorati wouldn’t let me claim http://mynethome.net/blog or http://blog.mynethome.net, both of which point to my blog. It instead let me claim my home page http://mynethome.net. I couldn’t figure out how the pinging on Technorati worked, though. How does Technorati understand that there is new content? I had a feeling that it has something to do with publishing feeds. And therein lay a problem – my feed lay on my blog, not on my homepage.
  2. That is where the second thing came in. I wrote my own feed generator. So far I have written it only for Atom 1.0 – I am yet to do it for RSS2. The feed generator posed some interesting challenges:
    1. How was I to manage the list of sites in a flexible manner for the feed to report against? I solved this using a simple XML-based sitemap. You can very easily write a simple XML to represent your website. XML being structured content, I could use the same document to create a PHP sitemap.
    2. My sitemap is better for static content. What was the best way to show the dynamic content like that in my blog? After a bit of thinking and a bit more research I figured out that the easiest of ways to handle this – a link to a feed in my sitemap. I put in a linkfor my blog saying:
      <feed>http://mynethome.net/blog/?feed=atom</feed>

      I then used the Zend Framework to get the entries in the feed. On the fly I pull up the entries whenever my site’s feed is accessed and show the results. For this I used the Zend_Feed class.

      I considered dynamically determining the information from the page’s <link> tags, but ran into some issues processing some feeds, when Zend kept throwing Exceptions.

    3. The third was to dynamically get the summary and content of all the pages in my sitemap, to display in my feed. For this I used the Simple HTML DOM Parser. That let me retrieve the details from a specific segment of my pages. In most of my pages the main content is stored in a div block with class content, so retrieving the details was pretty easy.

The next step would be to write an RSS2 feed, just to ensure compatibility with feed readers that can’t read RSS2. I would be interested to know, though, if people who are savvy enough to use feed readers would settle for a feed reader that isn’t compatible with all kinds of feeds.

In the meanwhile I would like to welcome suggestions if you know how to do this better.

Googly for Google

Though the Google Watch site has been around for quite some time now it was only recently that I came upon it. The site kicked quite a few thoughts into action, not because it is accurate, which it is quite often not, but because it set off the whole line of thinking around Eloi and Morlocks.

Google Watch’s Daniel Brandt talks about Gmail being creepy, how Google is a privacy hazard, how its caching is a copyright violation and how Google Print is potentially depriving authors of royalties. Naturally, given the popularity of Google, the site has a fairly large number of detractors and takers.

But the one thing they love more than a hero is to see a hero fail

Green Goblin to Spider-Man in Spider-Man

Well, Google is not a hero of the society, but it certainly is its darling. And while I come off as a person with a bias towards Google:

  1. I don’t work for Google.
  2. I don’t own their stock.
  3. I wish I could do at least one of the two above.
  4. I do believe that a lot of things that come out of Google’s stables could use an overhaul, like the accuracy of Google Maps and the polish of Google Talk.

Daniel Brandt’s crusade is laudable. He points out a lot of shortcomings of the PageRank algorithm, like how easy it is to fool, how “Back Links” often play truant when it comes to deciding the relevance of a page and so on. And these are indeed true. But I believe that in most other aspects Mr. Brandt gets it wrong.

One thing that piqued my interest when Gmail made its entry in the market was how and why would people kick up such a big fuss about targetted advertising. Most people aren’t averse to having spam filters and virus checkers on their email accounts. Yahoo, Microsoft and every other email provider worth its name provides these facilities. And these are automated tools that go through each of your mails before deciding if something has a virus or is from a suspect source. So why complain with Gmail, when all it is doing in addition to checking for junk in your mail is putting in some ads in context of your original mail? In any case, the negative publicity seems to have paid off and most lay people shy away from opening Gmail accounts.

The privacy hazard accusation is another thing altogether. It is also something that I am least equipped to address, coming from a place where privacy is probably among the last of people’s concerns. Hence I will defer to Mr. Brandt’s comments regarding the easily available information about a person, though I must say that the only way cookies will cause you harm is if someone very adept at extracting information from cookies has access to the computer you use to access the internet, particularly to the folders where your browser stores the cookies for you. Google is not that villain.

That brings me to caching and Google Print. Google keeps obvious copies of pages in its cache. As of today so do Yahoo and MSN, but not Windows Live or Ask.com. By obvious what I mean is that it is difficult, if not impossible to fetch results from the web fast without actually having a copy of the pages on an instance local to you. So even though Windows Live Search or Ask.com don’t display their cache to you, it doesn’t mean they don’t have one. They simply choose not to show it. Now, if you don’t want your page to be available and available fast for any web-search, why would you want a webpage at all? Having search engines throw up your page at the top of a heap is good advertisement for your site and caching certainly attains that goal. As for Google Print, it aims at scanning all books in a library and making them available through a search. A search on Google Print, besides showing a few pages of the book does nothing else. You certainly cannot read the entire book without having to pay the author any royalty. As I see it Google Print is one way of representing a library online. When you have a membership to a library you still get to read the books there, or at least excerpts from them without having to buy them. Is that a copyright violation? Google, by bringing the library online is, I believe, simplifying things to a great extent.

The entire policy of policing Google puts a curious spin on things. Sites like Google Watch would have us believe that Google has only one interest – information on our lives for its money. In this way an analogy can be drawn between Google and the Morlocks of our title – Google provides us information, while secretly harvesting information on us and setting us up for slaughter. Google Watch indeed says that people matter naught to Google.

Pointy Haired Boss: I’ve been saying for years that “Employees are our most valuable asset”… It turns out that I was wrong. Money is our most valuable asset. Employees are ninth.

Wally: I’m afraid to ask what came in eighth.

Pointy Haired Boss: Carbon Paper.

– Dilbert’s Still Pumped From Using the Mouse by Scott Adams

I am, however, not so sure. Google prides itself on good search results, but most people at Google itself consider it an advertising company. They also know that they are only going to be this rich as long as they manage to stay ahead of the curve with their core competency – search. The key rule of advertising is to not rub people the wrong way. When people find out that a company is scamming them, they boycott the company of their own volition. Google, by dishing up innovation through technology and not actually charging any of its users for anything will manage to stay the people’s favourtie for long, or at least until something better comes by.

Enough said. The intent of this page is not to sell itself by advertisement of Google or Google Watch: I have deliberately tried to avoid any “Search Engine Optimisation” here, apart from having written this in my blog. I do believe that not only does Mr. Brandt make a lot of good points, but also that quite a lot of his statements that I don’t agree with provide healthy food for thought. The issue here is not whether a company is getting fascinatingly rich, but whether the contributors to its wealth get anything in return. I believe I get a decent return out of using Google’s products (I also get a lot out of using Microsoft and Yahoo’s products) and I don’t believe my privacy is compromised. I consider myself satisfied.

My very first post after restarting my blog and I already seem to have violated the objective of being objective. But then, perhaps one of the most biased statements possible is, “This statement is unbiased”, by means of which the statement automatically biases itself in its favour.

I will sign off by pointing out a classic paper, “Reflections on Trusting Trust” by Ken Thompson. Whom will you trust?