Month: September 2010

Dollhouse

I’ve been watching Dollhouse. Yep, behind the times, that’s me. There’s two things about it that I’ve been really enjoying:

1) The way the Dolls move around the Dollhouse in their vacant state, and the way those movements and patterns change throughout the series.
2) The ambiguity of who is on whose side, particularly in the one I just watched
3) The way they’ve done interesting things with ambiguity without doing the whole who am I to judge, what is good and what is evil relativism thang.

Pop culture. Sometimes I like it!

the gauntlet is caught

further to this stuff, a ruby version:

class Array
  def group_seq
    reduce([]) do |acc, obj|
      if acc != [] && (yield obj) == (yield acc.last.first)
        acc.last << obj         
        acc                     
      else              
        acc << [obj]            
      end               
    end         
  end   
end

The (yield obj) == (yield acc.last.first) will mean that things get grouped according to the value of the predicate. If you change the == to &&, then only values for which the predicate is true will be grouped.

It turns out that python already has this in it’s standard library, so that’s easy:

map(lambda (b,i): list(i), groupby([1,2,3,4,5,6,7,8], lambda n: n % 2 == 0))

Stabbing at mongo in the dark

I picked up MongoDB the other day. I think I got what I wanted from the exercise. What were the surprises, though?

1. In the javascript shell, you need the right arrangement of current database and collection before you see anything interesting. When you’re doing db.record_baskets and seeing nothing, even though you know there’s a record_baskets collection, for me that was because I was in the wrong database. Doing “use mydb” sorted things out.
2. When you save a record, it’s only if you explicitly say what to replace that you get a replacement operation. Otherwise you just add another record, and then when you search by whatever your condition is you’ll get two records. It took me a while to work this one out because my records were big-ol arrays, so it wasn’t entirely obvious that there was two of them.
3. There’s some crazy magic around index construction. Like, totally crazy. I did a search for {i: 1234}. It took 237 seconds. I did an ensureIndex for {i: 1}. It took half a second. Then repeating the search took no time at all. How come the linear scan took so long, but index construction so little? I have two theories. The first is that there’s some heuristical thing that mongo uses to decide when to make indexes, and it made an index in the process of issuing the first query, so ensureIndex didn’t actually have to do anything. The second is that the data storage format is such that all of the i-values were stored in an arrangement that was already usable as an index, so it was able to recognize that, mark it as an index, and use it as such without actually doing much rearranging.

Also, the modifier operations are really neat and I want to do something that uses them to the fullest extent.

So, that was fun.

Tim throws the gauntlet

My friend Tim had a programming problem, described at his joint.

I thought about the problem for a while, started implementing it in ruby, thought “I could implement this better in haskell”, so I did:

group_seq :: (Int -> Bool) -> [Int] -> [[Int]]
group_seq pred [] = []
group_seq pred (x:xs)
  | pred x    = let (group, rest) = span pred (x:xs)
                in group:(group_seq pred rest)
  | otherwise = ([x]):(group_seq pred xs)

Then I implemented it in ruby:

class Array             
  def group_sequential
    result = [] 
    i = 0       
    while i < length
      group = [self[i]] 
      i += 1            
      if yield self[i-1]
        while i < length && (yield self[i])
          group << self[i]              
          i += 1                        
        end                     
      end               
      result << group   
    end         

    return result
  end   
end  

And I thought, that would be much nicer if we had a enumerator interface with a “get next” and a “current value” method. Ruby doesn’t give us these toys, as far as I can see, but we can simulate them for a small dataset by doing Array.dup, then arr.shift is “move to next and return last” and “arr.first” is “current value”:

class Array
  def group_sequential
    arr = self.dup
    result = [] 
    while arr.first
      group = [arr.first]
      if yield arr.shift
        group << arr.shift while arr.first && (yield arr.first)
      end               
      result << group   
    end         

    return result
  end   
end

And that one feels pretty good to me… It feels like it’d be nicer in a language with an iterator interface of the right sort. Hey, python has one of those, more or less, let’s try that.

def group_sequential(arr, pred):
        iterator = iter(arr)
        result = []
        group = []
        value = iterator.next()
        try:    
                while True:
                        group = [value]
                        lastvalue, value = value, iterator.next()
                        try:    
                                while pred(lastvalue) and pred(value):
                                        group.append(value)
                                        value = iterator.next()
                        except:
                                pass
                        result.append(group)
        except: 
                result.append(group)
        return result

And that one feels completely dreadful! Not being able to interrogate the iterator for the current value makes things a bit awkward.

I’m not entirely happy with any of them. Wanting to go back to the drawing board, I translated the functional one fairly directly into Ruby, hoping to get beautiful code at the cost of weird performance characteristics:

class Array
  def group_sequential(&block)
    return [] if empty?
    return [self] if length == 1 
    rest = self[1..-1].group_sequential(&block)
    if (yield self.first) && (yield rest.first.first)
      rest.first.unshift(self.first)
    else        
      rest.unshift([self.first])
    end         
    rest        
  end   
end

Which is possibly the best yet, although it’s very strange-looking for ruby code. But I don’t like any of them, really.

In conclusion, I Don’t Know.

ports

Don’t you just love spending your evening removing and reinstalling all your ports, because some combination of OS changes and ports changes means you now need to look very carefully for 32/64 bit lovin?

And you get an error message that google doesn’t help with?

The error message was:
Error: Unable to open port: invalid command name “include”

It was proffering up this message immediately after trying to clean base64. Base64 wasn’t the problem, though.

Started by cd’ing into /opt/local/var/macports/sources/rsync.macports.org/release/ports. “sort < PortIndex.quick | less" told me that the package after base64 was bash. When I looked in shells/bash/Portfile, I saw a line starting with include. Commented that out, ran "sudo port clean –work –archive all", and it worked. No idea why it was a problem for me, though.

Well, actually it didn't work. It stopped again just before vim, so I "fixed" vim in the same way. When you do port clean –work –archive all, it goes all the way from the start, so I hacked this together to Not Do That:
sudo sh -c "sort < PortIndex.quick | cut -f 1 -d ' ' | grep -A 30000 '^vim$' | xargs port clean –work –archive"

Lather. Rinse. Repeat. Cry. And check if your portfiles are still hacked up before you go installing things again.

(My guess as to cause: the portfile syntax used to have an include thing, but it's been deprecated. By running a new port command against an old portfile repository, I made bad things happen. That's just a guess)

Comparing MongoDB and PostgreSQL for a particular application

First, an admission: the comparison I did was neither thorough nor comprehensive, and provided only a slim reason to go one way or the other.

I was looking at a proposal for some work, and it specified that the system would have 300 million records, and some information about the likely structure of those records. Looking at that number and the likely structure, I had two questions:
1. Will we have to worry about how PostgreSQL goes with those sorts of numbers, spending time sharding and distributing and generally making life harder than it is with a single postgres instance?
2. Would MongoDB make our lives better?

The structure of the data is up to 30,000 buckets, with up to 10,000 records in each bucket. The obvious relational thing is to have a buckets table, and a records table, where the records table has a bucket_id and some other data. In this case, I used two decimals for the “some other data”, assuming that using more than just two decimals would change the constant factors but the basic performance characteristics would be the same. From what I know of the actual application, there’ll be another layer of indirection – something will have a bucket_id, and we’ll go from the something to the bucket to the records. Most operations will be either “append to bucket” or “get contents of bucket”, with the latter sometimes involving additional sorting/filtering.

My main interest is in looking ahead to a possible future where we are under deadline, and we have a performance problem, and we need the datastore to just do those append and get-contents operations as quickly as possible. If only one of them can be quick, I want it to be getting the contents of a bucket, because we can potentially queue appends.

So, I knocked up a little database schema for postgres, put appropriate indexes on it, and instructed the computer to invent 300,000,000 records, 10000 for each of 30000 buckets. An hour and a half later, I had them. I ran two tests, one for adding records to a bucket, and one for getting the contents of a bucket. They took 6 and 7 seconds of wall-clock time each (running the former 10000 times and the latter 100 times).

Then I produced a similar-ish thing in mongo. I created a collection, and to that collection I added an entry for each bucket. Each entry contained an index integer and a list of records. Trying approximately the same operations, append-to-bucket and retrieve-bucket, yielded 10 and 24 wall-clock seconds respectively.

Mongo and Postgres both coped fine with the volume of data. I suspect mongo’s relatively-slow performance is a result of the document size, and am curious about if arranging the data into smaller documents would work better. It’s all a bit irrelevant, though, because we have lots of experience of doing rails apps with postgres, and it looks like that’ll work fine at this scale, at least as far as the database is concerned.

Hey, caveats! I compared the mongo driver to ActiveRecord::Base.connection.execute; I didn’t compare mongo mapper to activerecord. The mongo document arrangement I chose may well have been suboptimal.

So, what did I learn from the experiment?
1. We can recommend postgres and expect not to get burned by that choice.
2. While mongo is shiny and interesting, it’s not obviously better for this project than postgres.

Stubbing out OpenID: integration edition

This is an update to Stubbing out OAuth or OpenID, on how everything hangs together now.

I have a file “spec/stub_openid.rb”, which contains:

require 'openid'
require 'openid/extensions/ax'
require 'gapps_openid'

class OpenID::Consumer
  def begin(*args)
    o = Object.new
    class << o
      def add_extension(*args)
      end

      def redirect_url(*args)
        "/session/create"
      end
    end
    o
  end

  def complete(*args)
    OpenStruct.new(:status => OpenID::Consumer::SUCCESS)
  end
end

class OpenID::AX::FetchResponse
  def self.from_success_response(response)
    OpenStruct.new(:data => { "http://schema.openid.net/contact/email" => User.first.email })
  end
end

This has changed slightly from the previous version. We’re now doing some fancy meta stuff because the gapps_openid extension – required to make the openid library find the accounts hosted under Google Apps for Your Domain – changes the way the extension works a bit.

Because the library names are a bit nutty, we have these two gemfile lines:
gem “ruby-openid”, “2.1.8”, :require => [‘openid’, ‘openid/extensions/ax’]
gem “ruby-openid-apps-discovery”, “1.2.0”, :require => ‘gapps_openid’

That stub_openid.rb file is ignored in production, and those two lines mean the library gets included properly. Thanks, bundler!

In order to make the library not be hit in the cucumber environment, our config/environments/cucumber.rb file requires ‘spec/stub_openid’.

And now it’s time for rainbows and unicorns.