Thursday, June 26, 2008

WOW!!!... that is all i can say.

So today I sent myself from home a script i wrote to check my rankings on google.com for my home site ListThatAuto.com. Basically it took my keywords list, parsed it into google urls, uses sockets to connect, grabs the content, parses through the content to pull the result links, checks the result links if it has the name of my site in it, and displays that link and position number on the page if it is there. Pretty simple and straight forward.

Here was the problem. When running the script at home, it would time out because each socket call to google took about 1.1 seconds from start to finish for grabbing the content and processing it. This unfortunately limited me to about 28 keywords at a time to check. I knew that I could do better so I did a few home brewed load balace tests, and was able to identify parts of my code that could use subtle but significant improvement. I made these changes one by one and managed to reduce the time to about .8 seconds per keyword. Still I was capped at about 37-39. My personal site has only 41, so i was just a few away. After about another hour of load testing, I came to the conclusion that my code was as efficient as it would get.

Though it was rather inefficient, I still sent it to work. This is where the most amazing breakthrough took place. If you program in php, you have probably used sockets. Well during my tests at home, I found that most of my time per keyword was just communicating with google. I posed this to our dev team (which i am part of) and we brainstormed for a few minutes on this and developed a theory.

We believed that with PHP all a socket does is open a stream, read it from the source, and store it in a buffer for the script to access at anytime. This means that we could essentially send the request for the information immediately after opening the socket, and just move on to the next connection and do the same until all the connections are made, and then slowly come back and clean up by storing and processing the content as we grab it at our own pace.

I re-wrote my class to a single recursive function to grab, process, and display the above described information. This new script, using the idea of just opening the connection and requesting the data and moving to the next one before retrieving it does the entire process in about 3.5 seconds.

Pretty Impressive if you ask me.


loushou.

No comments: