Tech Archives, page 19

All things technology. Very popular here on Justinsomnia.

Load object by id + database call or by array?

I closed my post I heart foreach with the following question:

What I don’t know is how the performance savings of select * from table compares to the memory savings of select id from table + select * from table where id = $id.

Tonight I decided to answer that question. First I installed Xdebug. I already had Pear, so this did the trick:

sudo apt-get install php5-dev
sudo pecl install xdebug-beta

Then I added extension=xdebug.so to php.ini and restarted Apache. Thanks to Install xdebug for PHP5 on Ubuntu Feisty for the steps.

Then I had some fun. I downloaded datasets from Frequently Occurring First Names and Surnames From the 1990 Census, turned them into PHP arrays, and then generated a table with 100,000 records consisting of a primary key id, a random first name, and a random last name. Here’s a sample:

mysql> select * from names order by rand() limit 10;
+-------+------------+-----------+
| id    | first_name | last_name |
+-------+------------+-----------+
|  2535 | Lynelle    | Schmucker | 
| 25092 | Arnette    | Treglia   | 
| 92914 | Pandora    | Vonruden  | 
| 84120 | Nicholas   | Granados  | 
| 34593 | Roseline   | Salvitti  | 
| 40300 | Blondell   | Guerard   | 
| 50564 | Ivelisse   | Schuur    | 
| 15215 | Lynetta    | Naiman    | 
| 22300 | Antonia    | Lineback  | 
| 70832 | Jesse      | Agoras    | 
+-------+------------+-----------+

As a starting point, I wanted to test the “array of objects” design, and profile it using xdebug_time_index, xdebug_memory_usage, and xdebug_peak_memory_usage. The Xdebug documentation doesn’t really explain what those measure (seconds and bytes, I assume), but I figured at least it’d at least give me a way to judge relative performance and memory load.

So here was my reference code:

<?php
mysql_connect('localhost', 'root', '');
mysql_select_db('test');

class Name
{
  public $id;
  public $first_name;
  public $last_name;

  public function __construct($id = 0)
  {
    if ($id > 0) {
      $rst = mysql_query("select * from names where id = $id");
      $row = mysql_fetch_array($rst);
      $this->id         = $row['id'];
      $this->first_name = $row['first_name'];
      $this->last_name  = $row['last_name'];
    }
  }

  public static function bySQL($sql)
  {
    $rst = mysql_query($sql);
    $names = array();
    while ($row = mysql_fetch_array($rst)) {
      $names[] = new Name($row['id']);
    }
    return $names;
  }
}

And here was the test:

foreach (Name::bySQL("select id from names") as $name) {
  // do something
}

print "time: " . xdebug_time_index()        . "<br />";
print "pmem: " . xdebug_peak_memory_usage() . "<br />";
print "mem:  " . xdebug_memory_usage()      . "<br />";

The result was pretty scary. It took a long time to build and loop through all 100,000 objects:

time: 50.3534600735 (50-70 sec)
pmem: 39,267,200
mem: 3,205,784

Just for grins and giggles, I decided to make a modification to the constructor and bySQL functions, to pass along the complete array representing the database row, rather than having to go back and query the database again:

class Name
{
  public $id;
  public $first_name;
  public $last_name;

  public function __construct($id = 0)
  {
    if (is_array($id)) {
      $this->id         = $id['id'];
      $this->first_name = $id['first_name'];
      $this->last_name  = $id['last_name'];
    
    } elseif ($id > 0) {
      $rst = mysql_query("select * from names where id = $id");
      $row = mysql_fetch_array($rst);
      $this->id         = $row['id'];
      $this->first_name = $row['first_name'];
      $this->last_name  = $row['last_name'];
    }
  }

  public static function bySQL($sql)
  {
    $rst = mysql_query($sql);
    $names = array();
    while($row = mysql_fetch_array($rst)) {
      $names[] = new Name($row);
    }
    return $names;
  }
}

Here was the test this time (select * instead of id):

foreach (Name::bySQL("select * from names") as $name) {
   //do something
}

The result was astounding:

time: 8.96280193329 (6-9 secs)
pmem: 34,094,768
mem: 3,207,200

Peak memory usage dropped only slightly, but the script execution time dropped by almost an order of magnitude. Clearly most of the time was being spent executing those 100,000 additional selects (duh!), and there was no memory cost associated with doing an initial select *.

Next I tried the Collection class detailed in I heart foreach, minus the mysql_connect and mysql_select_db function calls. The test looked like this:

foreach (new Collection("Name", "select id from names") as $name) {
  // do something
}

Once again, performance was surprising. I was expecting an all around performance and memory usage improvement, but it turns out the Collection class actually ran slower:

time: 62.3732869625
pmem: 77,040
mem: 76,944

Of course memory-wise, the Collection object kills. From 39MB to 77KB! Granted this was still using the select id design, so I modified Collection::cacheNext() to pass along the whole array (not just the id value):

$this->value = new $this->class_name($row);

Using the second version of the Name class above, the test became:

foreach (new Collection("Name", "select * from names") as $name) {
  // do something
}

And finally I got the results I was looking for. Memory usage was less, and time dropped from 60 seconds to 13.

time: 12.9740481377
pmem: 75,960
mem: 66,872

Moral of the story: select * from table is much faster than select id from table + select * from table where id = $id and isn’t burdened by any sort of memory tax.

I heart foreach

My programming life is not very glamorous. Most of my time is spent in loops, usually working over arrays of objects that are loose wrappers around records stored in a database. Pretty much the bread and butter of all web applications.

Which is how I became fast friends with the foreach control structure in PHP:

$letters = array('a','b','c');

foreach ($letters as $letter) {
  print $letter;
}
//outputs: abc

The corresponding for-loop is ghastly by comparison:

for ($i = 0; $i < count($letters); $i++) {
  print $letters[$i];
}

Because of this, I’ve written a lot code that returns large arrays of objects, only to be iterated over using a foreach. The problem with this method is that each object has to be instantiated in advance and shoved into an array before any work can be done on it. The longer the array of objects, the more memory and time required.

What I needed was a way to use foreach to instantiate a new object at the beginning of each iteration—and then discard it at the end—so that at no point would more than one object exist in memory.

It was looking like I’d have to leave my precious foreach behind (for a while-loop) when I discovered that in PHP5, I can define a class that implements PHP’s internal Iterator interface—giving it the crucial methods that allow a foreach to iterate over an object (rewind, next, current, key, valid), giving me the power to decide when the individual objects in the collection are instantiated.

Meaning my code can continue to use the elegantly readable foreach, but instead of passing it an array of objects, I can pass it a custom Collection object with the ability to instantiate each child only when foreach requests it. Here’s the code:

<?php

class Collection implements Iterator
{
  private $class_name;
  private $rst;
  private $key = -1;
  private $value;
  private $length = false;

  public function __construct($class_name, $sql)
  {
    $this->class_name = $class_name;

    if (strtolower(substr(trim($sql), 0, 6)) == 'select') {
      // this is here for illustrative purposes
      // you probably want to wrap this in a DB class 
      $conn = @mysql_pconnect(DB_SERVER, DB_USER, DB_PASSWORD);
      @mysql_select_db(DB_NAME, $conn);
      $this->rst = @mysql_query($sql, $conn);
      
      $this->rewind();
    } else {
      // throw some kind of error
    }
  }

  public function rewind() 
  {
    if ($this->key != 0) {
      $this->key = 0;
      @mysql_data_seek($this->rst, 0);
      $this->cacheNext();
    }
  }

  private function cacheNext()
  {
    if ($row = mysql_fetch_assoc($this->rst)) {
      $this->value = new $this->class_name($row['id']);
    } else {
      $this->value = false;
    }
  }

  public function current() 
  {
    return $this->value;
  }

  public function key() 
  {
    return $this->key;
  }

  public function next()
  {
    $this->key++;
    $this->cacheNext();
    return $this->current();
  }

  public function valid() 
  {
    return $this->current() !== false;
  }

  public function length()
  {
    if ($this->length === false) {
      $this->length = mysql_num_rows($this->rst);      
    }
    return $this->length;
  }
}

?>

One possible improvement: Each constructor of our model classes accepts an id parameter which is used to load the rest of the fields for that record from the DB. But since the Collection object already has to execute a select query to get the id from the DB, it seems like it might as well grab the rest of the fields for that record at the same time—and then use them to instantiate a child object without an additional database select. What I don’t know is how the performance savings of select * from table compares to the memory savings of select id from table + select * from table where id = $id.

Update: Load object by id + database call or by array?

Sights of Maker Faire 2007

Maker Faire 2007: Bicycle carousel
Bicycle Carousel
Maker Faire 2007: Parent-powered carnival swings
Parent-powered Carnival Swings
Maker Faire 2007: Power Tool Dragsters
Power Tool Dragsters
Maker Faire 2007: Power Tool Drag Race, jump of doom!!!
Power Tool Drag Race: The Jump of DOOM!
Maker Faire 2007: Lego Model Railroading
Lego Model Railroading (the other LUG)
Maker Faire 2007: Cityscape artcar
Cityscape Artcar (one of many)
Maker Faire 2007: Thumb-sized solar car with circuit board flames
Thumb-sized solar car (complete with circuit board flames)

Sunday was only 82,800 seconds long

Yesterday I learned the hard way that there are not always 86,400 seconds in a day. It just so happens that yesterday on Sunday there were only 82,800 because yesterday Sunday was only 23 hours long. This is particularly tricky when you use a programming language that stores time as seconds since Jan 1, 1970, and you want to iterate over a date range.

You might want to do something harmless like:

function date_range($start_date, $end_date) {
  
  $current_date = strtotime($start_date);
  $end_date     = strtotime($end_date);

  while ($current_date <= $end_date) {
    $date_range[] = strftime("%Y-%m-%d", $current_date);  
    $current_date += 86400;
  }

  return $date_range;
}

But don’t. If your date range falls over daylight saving time (as mine did yesterday), you’ll be missing your end date. I fixed this particular bug by relying on the liberal input of strtotime:

function date_range($start_date, $end_date) {
  
  $current_date = strtotime($start_date);
  $end_date     = strtotime($end_date);

  while ($current_date <= $end_date) {
    $date_range[] = strftime("%Y-%m-%d", $current_date);
    $current_date = strtotime("+1 day", $current_date);
  }

  return $date_range;
}

So, word to the wise: wcgrep 86400

Update: Stephanie looked at my post and said, “Didn’t daylight saving time start on Sunday?” Umm, yes, yes it did.

Escalating the war on comment spam

Well it had to happen sooner or later. I guess. Someone finally managed to get through my spam defenses. They were really pretty basic. An experiment really. I had:

  1. renamed wp-comments-post.php
  2. html-entity encoded the comment form’s action attribute (the value being my renamed wp-comments-post.php)
  3. added a hidden input field to the form that only got written to the page via JavaScript
  4. updated the renamed wp-comments-post.php to check to make sure that that hidden field had been submitted and had a value

yahoo captchaI did all this because I really don’t like captchas (and because I could). I don’t mind requiring JavaScript in order to post a comment. It’s a pretty clever way to prove that a human is sitting at a fully featured web browser submitting the comment because bots don’t usually come with JavaScript parsing engines—the overhead is too great.

The reverse strategy, using a negative captcha (which I haven’t yet employed), involves adding an additional input field directly to the form and then hiding it with CSS. If it gets submitted with any content, you’ve caught yerself a spambot.

The first batch of 90 spam comments that came through (the largest I’d ever received at once) all had the same IP address, so it was easy to blacklist. The second batch all had different IPs (wow, a coordinated attack!) but all the spam linked to hometown.aol.com subsites. Easy block, but one that made me uneasy about the possibility of blocking legitimate content.

Then yesterday morning, the spam was coming in as I was checking my site (all linking to ca.geocities.com sites). Each time I reloaded there was something new. This provided me with the perfect spam laboratory. First, I renamed my already renamed wp-comments-post.php file, re-encoded the new name in HTML entities, and updated the comment form. It had no effect. I tailed my httpd access logs and the spammers instantly started POSTing to the new script. Which means they’re smart enough to regex the page they’re spidering for the form’s action attribute—and decode it.

In order to get past my hidden JavaScript captcha, I can only speculate that their regexes were just liberal enough to gather that they needed to POST something to my very unobfuscated hidden input field within a document.write statement. If I’d had more time, I would have written the full POST array to a database table to get some more insight into what exactly they were sending my way.

So I upped the ante by creating an external JavaScript file, that looks like this:

var commentform = document.getElementById('commentform');

if (commentform != null) {
  hidden_input = document.createElement('input');
  hidden_input.type  = 'hidden';
  hidden_input.name  = 'some_name';
  hidden_input.value = 'some_value'; 
  commentform.appendChild(hidden_input);
}

It gets called and executed after the browser parses my comment form, which creates a new input element in the document’s DOM, sets its type, name, and value attributes, and appends it to the form. I then updated my renamed wp-comments-post.php to check for the hidden field and its expected value.

Take that spambots. Have a nice weekend.