On Selling Used Cars With Postgres

22 October 2015

Let’s say we have an inventory of used vehicles and we’d like to display a selection of these vehicles to a user. It would be ideal if we showed the user a group varied by brand / model. Toyota makes a great car but It wouldn’t do to only display Toyota Camrys over and over.

It’s not difficult to imagine a car as a simple data object:

cars = [
  { make: "Toyota", model: "Camry", year: 1996, mileage: 201293, color: "Green" },
  { make: "Toyota", model: "Camry", year: 2011, mileage: 11920, color: "Black" },
  { make: "Toyota", model: "Highlander", year: 2009, mileage: 47560, color: "Silver" },
  { make: "Chevrolet", model: "HHR", year: 2007, mileage: 128743, color: "Sandstone Metallic" }
]

Ideally if we ask to display two cars we’d get a Toyota and a Chevrolet. If we ask for three cars we’d get a two different Toyotas (a Camry and a Highlander) and a Chevrolet.

It isn’t too hard to get these two tests passing:

def sample(count, list)
  makes = list.map { |vehicle| vehicle[:make]}.uniq.cycle

  list.select { |vehicle| vehicle[:make] == makes.next }.take(count)
end

describe 'sample' do

  it 'returns a good distribution of makes' do
    sample = sample(3, cars)
    makes  = sample.map { |vehicle| vehicle[:make] }
    expect(makes.sort).to eq(["Toyota", "Toyota", "Chevrolet"].sort)
  end

  it 'returns a good distributions of models' do
    sample  = sample(3, cars)
    models  = sample.map { |vehicle| vehicle[:model] }
    expect(models.sort).to eq(["Camry", "Highlander", "HHR"].sort)
  end

end

The bad news is they’re false positives. If we change the order of the cars the test for distribution of models fails.

We can modify the test to fail sometimes by shuffling the cars array on each test run, but I don’t particularly like that idea. There’s no kill like over kill, so we can test all possible permutations of the cars array:

  it 'returns a good distributions of models' do
    cars.permutation.to_a.each do |permutation|
      sample  = sample(3, permutation)
      models  = sample.map { |vehicle| vehicle[:model] }
      expect(models.sort).to eq(["Camry", "Highlander", "HHR"].sort)
    end
  end

This fails, but not in the way I expect. Interestingly, there is a particular order of cars where ALL the tests fail. I find this hilarious and points out a glaring flaw in my implementation.

If the ordering of cars by make is [ 'Toyota', 'Toyota', 'Chevrolet', 'Toyota' ] and the ordering of the makes enumerator is [ 'Toyota', 'Chevrolet', 'Toyota', 'Chevrolet' ] it is possible to return only ONE sample. We got lucky with select.

Maybe we should step back: what do we actually want to do? We want to take a car with a matching make from the makes enumerator (preferably one we haven’t seen before). This fixes the first test and gives an expected failure on the last test:

def sample(count, list)
  makes = list.map { |i| i[:make]}.uniq.cycle
  cars_by_make = list.group_by { |car| car[:make] }

  samples = [ ]

  count.times do
    possibility = cars_by_make[makes.next].shift
    redo unless possibility
    samples << possibility
  end

  samples
end

The test for Makes distribution is a bit more tricky. Initially I had thought we needed a product of the list of unique makes and unique models that we could use to choose values from cars:

def sample(count, list)
  makes  = list.map { |i| i[:make]}.uniq
  models = list.map { |i| i[:model]}.uniq

  pairings = makes.product(models).cycle

  samples = [ ]

  count.times do
    make, model = pairings.next
    possibility = list.find { |car| car[:make] == make && car[:model] == model }
    redo unless possibility
    samples << possibility
  end

  samples
end

This passes the tests, but if you’re paying attention you may have noticed that we generate ALL possible combinations of Make and Model. At best this is skipped over, since the pairings do not exist in the data. At worst it upsets some auto manufacturer’s legal departments with things like “Toyota HHR"s.

If you’re REALLY paying attention you may have noticed what we’re actually doing is re-implementing a self CROSS JOIN. We can net similar results with a relational database.

EDIT: The following only accidentally correct and somewhat silly. I think what we actually want here is WITH RECURSIVE but it probably warrants a post of its own.

cross_joining=# SELECT * FROM vehicles;                                                                                                                                                           make    |   model    | year | mileage |       color        
-----------+------------+------+---------+--------------------
 Toyota    | Camry      | 1996 |  201293 | Green
 Toyota    | Camry      | 2011 |   11920 | Black
 Chevrolet | HHR        | 2007 |  128743 | Sandstone Metallic
 Toyota    | Highlander | 2009 |   47560 | Silver


cross_joining=# SELECT DISTINCT make, model FROM vehicles WHERE
  (vehicles.make, vehicles.model) IN
     (SELECT make, model FROM
       (SELECT DISTINCT vehicles.make FROM vehicles) as j1
       CROSS JOIN
       (SELECT DISTINCT vehicles.model FROM vehicles) as j2);

   make    |   model
-----------+------------
 Toyota    | Camry
 Chevrolet | HHR
 Toyota    | Highlander
(3 rows)

We’re selecting the same varied group of makes and models, as in the Ruby example. Things tend to get a little odd as the table size grows. The query is quite performant, (even on tables over 100_000) but does not always give the results I’d prefer / expect.

Another weakness of this approach is we always get the same ordering of results. So if we wanted to show the user a different selection of vehicles on the next visit it can’t quite deliver. Perhaps in a later exercise I’ll try and find solutions to these problems.

Code from this post can be found here

Changes for this article can be found here

Notes:

This only SORT OF works, and need further improvements.
This is probably totally wrongheaded and my SQL is terrible.
I welcome corrections in the form of issues or PRs on this blog’s repo
I actually drive a Toyota Camry with ~200_000miles.

Special thanks to Nathan Long for proof reading, feedback, corrections, and being a generally awesome person.