Tip Jar Archives, page 4

Posts so helpful you might be compelled to tip me.

Search Engine Marketeers are the new script kiddies

Update: on March 12, 2008, Vadim Smelyansky emailed me and claims he was simply an innocent bystander doing freelance sysadmin work for an SEM company called LinksDealing, and has no knowledge of the internal workings of the system or how it’s being used (or misused). It’s conceivable that someone who managed to obtain the usernames and passwords to a large number of sites was then able to use that information with LinksDealing’s system in order to generate linkspam farms on a large scale.

Vadim says:

I just copy&paste database schema as is after day of playing with all possible MySQL parameters. I am not a SEO/SEM professional also I am not a DBA guru, but know enough about MySQL to support production servers.

The company registered in Israel. So I suppose that there can same named company in other countries. It is SE marketing, there is a link exchange server www.linksdealing.com and other SEM/SEO applications, some of them used to manage links on customers sites. Those sites can be banned without any notice from search engines. Probably it happens when somebody overdoing SEO on some customer site or experimenting. But there is no hackers or black hats. This database used by few of this applications. I have no idea of it complex logic.

I have no reason not to believe him, so I’ve removed the parts of this post that suggest he is the person who hacked my blog.

On August 8th, my blog, hosted at justinsomnia.org, disappeared from Google, completely, utterly without any warning or known provocation (e.g. black hat SEO), sending the traffic to my blog plummeting.

I complained to all known and normal channels, which in my opinion are too few and far between. I checked Google’s Webmaster Central tools, which merely confirmed that my site no longer existed in their index. Frustrating.

Finally I emailed someone at Google that one of my co-workers knew. I felt bad doing this. There are millions of sites in Google. I shouldn’t have to email an individual directly for this kind of support. It just doesn’t scale. But alas. Yesterday morning I got a response. My contact at Google discovered that someone had actually hacked my site and was displaying search engine spam to search engine bots only!

Let me say that again. My blog was hacked! Ugh. So I have to admit I haven’t updated WordPress to the latest version, and I’m sure Gallery is not up to snuff either. What follows is a description of the hack and my eventually successful attempt to figure out who did this to me.

Here’s where I figure out what happened

Basically someone got access to my WordPress theme files. In footer.php the following line of code was added:

include('index2.php');

Then a file called index2.php was created that contained the following PHP code:

<?
$bots=array('ooglebot', 'yahoo', 'live', 'msn');
$y=0; for($i=0; $i<sizeof ($bots); $i++) if(strstr(strtolower($_SERVER["HTTP_USER_AGENT"]), strtolower($bots[$i]))) $y=1;
if($y){
  include('rq.txt');
}
?>

This means that if the user agent (e.g. web browser, search engine bot, feedreader, etc.) identified itself as the Google or Yahoo website indexer—instead of Firefox or Internet Explorer—the file rq.txt would be included on the page. That file contained a list of 20 search engine spam links, linking to several compromised sites (who I have notified), which in turn redirected you to the intended destination, in this case a supposed Canadian pharmaceutical e-commerce site canadianmedsworld.com:

<a href=http://www.bluehighways.com/albums/buy-levitra.html>buy levitra</a><br>
<a href=http://www.uxmatters.com/scripts/viagra-online.html>viagra online</a><br>
...

To confirm this, I switched Firefox’s user agent to Googlebot’s, Googlebot/2.1 (+http://www.google.com/bot.html), using the User Agent Switcher extension, and sure enough, the spam links appeared on EVERY PAGE of my site!!! Quelle horreur! I felt so violated.

Here’s where I figure out who did this

The timestamp on index2.php was Jul 3 13:35, which I believe was the initial date of the attack. The rq.txt file had been updated as recently as Aug 18 04:15. Then before my very eyes it was updated again yesterday, Aug 20 11:05, with even more spam links. I checked my http logs for both Aug 18 04:15 and Aug 20 11:05, but nothing looked out of the ordinary, just normal GET requests. Could my Dreamhost shell account have been compromised?—a fate even scarier than a WordPress bug.

So I started digging. Googling for the filenames created in the attack, I only found one other blog post describing the same symptoms in Spanish but without any really helpful information. My http logs don’t go back to July 3rd, but I have a JavaScript based request tracker which does. One minute after 13:35 on July 3 I found this very interesting request:

select * from request where request_id = 1857380\G
*************************** 1. row ***************************
        request_id: 1857380
       request_url: http://justinsomnia.org/
  request_referrer: http://fitis.google.com/rio/index.php?unit=adv_areas&sort_by=pr&sort_order=desc&page_n=1
      request_date: 2007-07-03 13:36:22
request_user_agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
        request_ip: 62.140.244.24

The most interesting detail is the request_user_agent value. But first some background. The reason for building a request tracker in javascript (rather than parsing server logs) is that search engine bots don’t parse JavaScript like a web browser does. So that serves as a reliable way to filter out automated requests from the human ones I’m interested in. This also means that the Googlebot user agent should NEVER appear in my stats. But sure enough, there it was, one minute after my blog’s theme had been hacked. Out of 1.9 million request records, only 70 ever identify as Googlebot (usually people who’ve changed their browser’s user agent for testing purposes). Then a day later, on July 4, my homepage was requested from the same IP (62.140.244.24), again with a user agent of “Googlebot”. What this means is that someone was manually checking my site in their web browser, masquerading at the Googlebot, to see if their hack had succeeded.

Now let’s take a look at the request_referrer value. That’s the URL of the webpage the person had in their browser when they clicked on a link pointing to http://justinsomnia.org/ (presumably out of a list of other hacked sites). First of all, http://fitis.google.com/ does not exist. That’s probably there to make the request look like it’s genuinely coming from Google. It’s very likely that they’d simply mapped that hostname to localhost in /etc/hosts. rio is presumably the name of an application for hacking sites and managing spam links. index.php is just the standard filename, and everything else is the query string. So I start Googling for a spamming application called “rio” or any occurrence of those query string variables in Google’s Code Search. Nada. Until I searched for the inauspicious adv_areas value in Google proper, and struck veritable gold.

There were only two results for that seemingly generic variable. The first of which was a mysql bug report containing what appears to be a partial database schema for an SEO hacking/spamming engine:

adv_pages_free | CREATE TABLE `adv_pages_free` (
  `adv_page_id` int(11) NOT NULL default '0',
  `randomized` int(11) unsigned default NULL,
  PRIMARY KEY  (`adv_page_id`),
  KEY `randomized` (`randomized`,`adv_page_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 

adv_pages | CREATE TABLE `adv_pages` (
  `adv_page_id` int(11) NOT NULL auto_increment,
  `hostid` int(11) NOT NULL default '0',
  `uri` varchar(255) NOT NULL default '',
  `industry_id` smallint(4) NOT NULL default '0',
  `theme` varchar(255) default NULL,
  `filename` varchar(36) default NULL,
  `committed` timestamp NOT NULL default '0000-00-00 00:00:00',
  `commit_id` int(11) NOT NULL default '0',
  `nlinks` int(11) NOT NULL default '0',
  `keyword` text,
  PRIMARY KEY  (`adv_page_id`),
  UNIQUE KEY `uniq_page_id` (`hostid`,`uri`),
  KEY `page_id1` (`hostid`,`adv_page_id`,`uri`,`industry_id`)
) ENGINE=MyISAM AUTO_INCREMENT=21777537 DEFAULT CHARSET=latin1

adv_areas | CREATE TABLE `adv_areas` (
  `adv_page_id` int(11) NOT NULL default '0',
  `area_id` tinyint(4) NOT NULL default '1',
  `sentence_id` int(11) NOT NULL default '0',
  `anchor_text` varchar(255) NOT NULL default '',
  `promoted_id` int(11) NOT NULL default '0',
  `promoted_type` tinyint(1) NOT NULL default '2',
  `crawlMask` tinyint(4) NOT NULL default '0',
  UNIQUE KEY `uniq_area_id` (`adv_page_id`,`area_id`),
  KEY `promoted_id` (`promoted_type`,`promoted_id`),
  KEY `promoted_type_2` (`promoted_type`),
  KEY `promoted_type` (`adv_page_id`,`promoted_type`,`area_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1

adv_hosts | CREATE TABLE `adv_hosts` (
  `hostid` int(11) NOT NULL auto_increment,
  `hostname` varchar(255) default NULL,
  `rev_hostname` varchar(255) default NULL,
  `port` smallint(6) NOT NULL default '80',
  `ip` varchar(50) default NULL,
  `classc` int(11) NOT NULL default '0',
  `oldip` varchar(50) default NULL,
  `link_industry_id` int(11) default '18',
  `g_known` tinyint(1) default '0',
  `y_known` tinyint(1) default '0',
  `m_known` tinyint(1) default '0',
  `g_banned` tinyint(1) default '0',
  `y_banned` tinyint(1) default '0',
  `m_banned` tinyint(1) default '0',
  `customized` tinyint(1) NOT NULL default '0',
  `modified` datetime default NULL,
  PRIMARY KEY  (`hostid`),
  UNIQUE KEY `hostname` (`hostname`,`port`),
  KEY `ip` (`ip`),
  KEY `iphostid` (`ip`,`hostid`),
  KEY `rev_host_name` (`rev_hostname`)
) ENGINE=MyISAM AUTO_INCREMENT=100404 DEFAULT CHARSET=latin1

You can interpret for yourself what you think the fields stand for, but it’s the 3 in the adv_hosts table that stand out the most to me: g_banned, y_banned, and m_banned. What else do G, Y, and M stand for these days other than Google, Yahoo, and Microsoft? Fields like “theme”, “filename” and “nlinks” (number of links?) also are suspicious. Note the AUTO_INCREMENT value for the adv_hosts page: 100,404! From that one could infer that as of July 16, 2007 (when the bug was reported), this guy had already hacked over 100k sites, containing 21,777,537 defaced spam pages. Stunning. Later in the bug report he adds “Unfortunately I can not provide database content.” Yeah, I bet you can’t.

That bug report contained one other incredible piece of information: the name of the reporter, whose LinkedIn profile and resume describe him as a Software Engineer at “SEM Professionals,” but who claims was not responsible for hacking my blog.

SEM, for the uninitiated, usually stands for “Search Engine Marketing” which for some (e.g. black hats) entails spamming or gaming search engines into increasing the rank of certain search results for their clients, through any means necessary it seems (for more information, see: Search engine marketing and Search engine optimization in Wikipedia.) Suddenly my circumstantial evidence was looking a lot less circumstantial.

Update: Information removed on March 12, 2008 per note at top of post.

Here’s where I figure out how it was done

Actually I’m not 100% sure. Dreamhost does not believe my password was leaked last June when they experienced an FTP-related leak of 3500 passwords, though the time of the first intrusion (July 3) coincides with other bloggers who discovered their sites hacked (e.g. mezzoblue). If not Dreamhost then the next likely culprit would be an unknown vulnerability in PHP or WordPress. However cross referencing the timestamps of the hacked file updates with my http access log turned up nothing.

Finally this morning Dreamhost sent me justinsomnia.org’s ftp access logs for the last 6 days which contained the smoking gun. Remember the timestamps of the updated rq.txt? Aug 18 04:15 and Aug 20 11:05. Check out the timestamps of the two most recent entries at the top of the log:

jwatt    ftpd23817    201.27.197.215   Mon Aug 20 11:05 - 11:05  (00:00)
jwatt    ftpd10510    83.170.6.133     Sat Aug 18 04:15 - 04:15  (00:00)
jwatt    ftpd31925    125.163.255.120  Wed Aug 15 17:47 - 17:47  (00:00)
jwatt    ftpd19135    125.163.255.120  Wed Aug 15 16:54 - 16:55  (00:00)

At which point I disabled FTP (which I never use), changed my passwords, and will shortly begin updating my software. But first I had to post this.

Update: You made it all the way through this post? Well then you deserve a commemorative t-shirt: In Soviet Russia, blog hacks you!

How to prevent dropped ssh connections

In honor of Ken Snider (sysadmin at FM, though more famously the much lauded sysadmin for Boing Boing), who told me how to do this.

One of the biggest annoyances I experienced after moving over to Ubuntu was having my SSH sessions time out behind my home wireless router (aka NAT). Though at the time I didn’t realize that was the problem. I thought the server at the other end was “logging me out.”

What made things worse is that when the connection got dropped, GNOME Terminal would completely freeze, no Ctrl-C, no nothing. All I could do was open a new terminal tab and close that one. Mondo-annoying. Occasionally I’d return to a terminal that had been connected to another server via ssh only to find I’d been logged out with the cryptic message:

Write failed: Broken pipe

The same thing with Gnome’s network folders. If I left one open for too long and then tried to copy a file, it’d spin for a few minutes before reporting that it couldn’t connect—and only then would it allow me to abort. Hitting the cancel button in process did nothing. I could either wait and do other things, or force-quit.

What made this irritating coming from Windows, where I’d previously relied upon WinSCP and SecureCRT for my file-transfer and remote connection needs, was that I’d already figured out how to configure both those applications to periodically send what they called NO-OPs or null packets to trick the server (actually the NAT) into keeping the connection alive for as long as I had the application open.

But with a little command line utility like ssh (from OpenSSH), there’s no Edit > Preferences menu, so I figured I was stuck. This is of course where my GUI background shows. Of course ssh has a preferences menu! It’s called ssh_config, and it’s stored as a textfile under /etc/ssh. Doh! Update: as HD mentions in the comments, it’s better to add this to your own local ssh config file here: ~/.ssh/config.

And thanks to Ken I learned that by adding the following line to ssh_config, (not sshd_config, he warned), “ssh will send a message through the encrypted channel [every 120 seconds] to request a response from the server.”

ServerAliveInterval 120

In other words, no more prematurely dropped ssh connections!

DIY portable chilled water

Brita On Tap Filtration System, $30
Rubbermaid 2.25qt covered pitcher, $3.41
Nalgene 16oz HDPE Narrow-Mouth bottle, $5

Brita Filter, Rubbermaid pitcher, Nalgene bottle

Total initial investment: $38.41. Yearly filter cost: ~$60. Drinking guilt-free bottled water: priceless.

So recently I read on Boing Boing:

There have been lots of stories lately about the inefficiency and environmental damages caused by bottled water…You’ll never want to drink Fiji again.

In the post was an excerpt from an article in Fast Company by Charles Fishman entitled Message in a Bottle (which Anton also mused on today in Water, water). This part caught my eye:

In fact, since the plastic for the bottles is shipped to Fiji first, the bottles’ journey is even longer. Half the wholesale cost of Fiji Water is transportation—which is to say, it costs as much to ship Fiji Water across the oceans and truck it to warehouses in the United States than it does to extract the water and bottle it.

Ouch. I could just see Fiji’s sales tanking thanks to Boing Boing’s amplification effect. I too felt my internal value system adjust to avoid bottled water more consciously in the future, though in actuality I purchase very little of it.

As it happens, I was at Trader Joe’s yesterday, and Stephanie picked up a bottle of Australian olive oil. I thought to myself, “Now wait, how is that any different? What about wine? What about anything that ends up in any grocery store?” Someone had to bring it there. Some amount of fossil fuels were burned in that process, whether via container ship, airplane, or truck. Why single out water?

Trader Joe’s sells a lot of water, but I instantly thought of the uber-environmentally and socially conscious Whole Foods. What about the gallons of water they sell? How do they justify it? Luckily I didn’t have to wait for an answer. Charles Fishman interviewed their CEO, John Mackey, in his article. Turns out his thoughts mirrored my own:

“It’s unfair to say bottled water is causing extra plastic in landfills, and it’s using energy transporting it,” he says. “There’s a substitution effect—it’s substituting for juices and Coke and Pepsi.” As for the energy used to transport water from overseas, Mackey says it is no more or less wasteful than the energy used to bring merlot from France or coffee from Ethiopia, raspberries from Chile or iPods from China. “Have we now decided that the use of any fossil fuel is somehow unethical?” Mackey asks. “I don’t think water should be picked on. Why is the iPod okay and the water is not?”

How to fix Eye of Gnome’s photo orientation in Ubuntu Feisty

After upgrading to Ubuntu Feisty Fawn, I discovered that Eye of Gnome (v2.18.1) no longer displayed vertical photos imported with Edgy Eft (or before) vertically. This seriously cramped my photo-reviewing style.

Eye of Gnome displaying a vertical photo imported with Edgy Eft horizontally

Curiously, Nautilus didn’t break a sweat:

Nautilus in Ubuntu Feisty Fawn displaying a thumbnail of a photo imported with Edgy Eft with the correct vertical orientation

Taking a look at Eye of Gnome’s preferences, I found an “Automatic orientation” option checked.

Seemed like that’s what I wanted. But I unchecked it just to see what would happen. Nothing. Until I opened a new instance of Eye of Gnome. Problem solved.

Eye of Gnome in Ubuntu Feisty Fawn (with Automatic Orientation off) displaying a vertical photo imported with Edgy Eft vertically

The good news is this fix doesn’t seem to affect images imported after upgrading to Feisty Fawn—which I’m assuming is due to the new “Rotate images physically” option in the photo importer.

Photo importer in Ubuntu Feisty Fawn showing the new Rotate images physically checkbox

I heart foreach

My programming life is not very glamorous. Most of my time is spent in loops, usually working over arrays of objects that are loose wrappers around records stored in a database. Pretty much the bread and butter of all web applications.

Which is how I became fast friends with the foreach control structure in PHP:

$letters = array('a','b','c');

foreach ($letters as $letter) {
  print $letter;
}
//outputs: abc

The corresponding for-loop is ghastly by comparison:

for ($i = 0; $i < count($letters); $i++) {
  print $letters[$i];
}

Because of this, I’ve written a lot code that returns large arrays of objects, only to be iterated over using a foreach. The problem with this method is that each object has to be instantiated in advance and shoved into an array before any work can be done on it. The longer the array of objects, the more memory and time required.

What I needed was a way to use foreach to instantiate a new object at the beginning of each iteration—and then discard it at the end—so that at no point would more than one object exist in memory.

It was looking like I’d have to leave my precious foreach behind (for a while-loop) when I discovered that in PHP5, I can define a class that implements PHP’s internal Iterator interface—giving it the crucial methods that allow a foreach to iterate over an object (rewind, next, current, key, valid), giving me the power to decide when the individual objects in the collection are instantiated.

Meaning my code can continue to use the elegantly readable foreach, but instead of passing it an array of objects, I can pass it a custom Collection object with the ability to instantiate each child only when foreach requests it. Here’s the code:

<?php

class Collection implements Iterator
{
  private $class_name;
  private $rst;
  private $key = -1;
  private $value;
  private $length = false;

  public function __construct($class_name, $sql)
  {
    $this->class_name = $class_name;

    if (strtolower(substr(trim($sql), 0, 6)) == 'select') {
      // this is here for illustrative purposes
      // you probably want to wrap this in a DB class 
      $conn = @mysql_pconnect(DB_SERVER, DB_USER, DB_PASSWORD);
      @mysql_select_db(DB_NAME, $conn);
      $this->rst = @mysql_query($sql, $conn);
      
      $this->rewind();
    } else {
      // throw some kind of error
    }
  }

  public function rewind() 
  {
    if ($this->key != 0) {
      $this->key = 0;
      @mysql_data_seek($this->rst, 0);
      $this->cacheNext();
    }
  }

  private function cacheNext()
  {
    if ($row = mysql_fetch_assoc($this->rst)) {
      $this->value = new $this->class_name($row['id']);
    } else {
      $this->value = false;
    }
  }

  public function current() 
  {
    return $this->value;
  }

  public function key() 
  {
    return $this->key;
  }

  public function next()
  {
    $this->key++;
    $this->cacheNext();
    return $this->current();
  }

  public function valid() 
  {
    return $this->current() !== false;
  }

  public function length()
  {
    if ($this->length === false) {
      $this->length = mysql_num_rows($this->rst);      
    }
    return $this->length;
  }
}

?>

One possible improvement: Each constructor of our model classes accepts an id parameter which is used to load the rest of the fields for that record from the DB. But since the Collection object already has to execute a select query to get the id from the DB, it seems like it might as well grab the rest of the fields for that record at the same time—and then use them to instantiate a child object without an additional database select. What I don’t know is how the performance savings of select * from table compares to the memory savings of select id from table + select * from table where id = $id.

Update: Load object by id + database call or by array?