Any database (postgres?) gurus? - Calgarypuck Forums - The Unofficial Calgary Flames Fan Community

photon · 03-17-2011, 12:30 PM

Lets say I have this query:

Code:

select count(cat.id) as cat_col0 
from cat 
inner join animal on cat.id=animal.id, house 
where house.id=10690 and 
(cat.id in (select a.id from house_animal, animal a where house.id=house_animal.house_id and house_animal.animal_id=a.id))

Basically there's cats, which are animals, and animals and houses have a many to many relationship.

So I want to find out how many cats are in a given house.

This code is being generated by a framework, so I don't have direct access to the house_animal table that stores the relationship between each animal and each house.

Now this query works, the strange thing is it takes FOREVER. Dozens of seconds or more on a quad core server with 12GB RAM. There's 11 million records in the house_animal table, but that shouldn't matter since there's only 10 animal records for a specific house, and there's an index on house.

The really weird thing is if I change the query manually to say:

Code:

select count(cat.id) as cat_col0 
from cat 
inner join animal on cat.id=animal.id, house 
where house.id=10690 and 
(cat.id in (select a.id from house_animal, animal a where 10690=house_animal.house_id and house_animal.animal_id=a.id))

Then it runs in 4ms.

Unfortunately I can't rewrite the query since it's auto generated, but if I understand why the two queries are different then I can possibly submit a bug report.

The long term fix for now is to take ownership of the many to many relationship myself and not depend on the underlying framework, but I would really like to know what the difference between the two queries is.

EDIT: I was going to post this to stackoverflow but thought it might be too simple or redundant, but it's hard to search for something this specific.

psyang · 03-17-2011, 03:19 PM

Each database will use different strategies when evaluating a query, so it's hard to know exactly why the query takes so long. But it's a safe bet that it is doing a full table scan.

Your original query is:

Code:

select count(cat.id) as cat_col0 
from cat 
inner join animal on cat.id=animal.id, house 
where house.id=10690 and 
(cat.id in 
  (select a.id from house_animal, animal a where
     house.id=house_animal.house_id and 
     house_animal.animal_id=a.id
  )
)

The subquery in the "in" clause is where your full table scan is occuring. The db is not optimizing the query by limiting the house_animal table to the rows with house_animal.house_id = 10690. As a result, the db is essentially executes a full inner join on the house_animal and animal tables (11M rows x # of rows in animal), then restricting that result by house_id=10690.

Why this would be happening is tougher to determine. Query optimizers try to take into account query structure and indices to determine the most effecient plan. Maybe house_animal.animal_id has an index, but house_animal.house_id doesn't, and so the optimizer evaluates the former clause first (inner loop) then filters based on the second clause (outer loop).

Or maybe the optimizer just doesn't like the way this query is structured. The last resort for a query optimizer is always a full table scan. The problem with auto-generated queries is that they will produce correct results, but not often effecient ones since they are always catering to the general case.

I have no experience with postgres, but this is fairly fundamental to most dbs. Hope this helps.

photon · 03-17-2011, 08:15 PM

Thanks, that makes sense. I thought it might be an actual mistake in the SQL (i.e. passing in the house.id into the subquery is forcing it to do a full table scan) rather than an optimization thing, but that makes more sense.

Yeah I knew going in that I would probably be spending some time hand tuning the queries once everything was working (kind of like making it right, then de-normalize where necessary). Though the framework I'm using is actually not too bad though, I can specify exact fields to add indexes with annotations right in the object and such.

In this case I don't even have to redo the query, I'll just manage the many to many relationship myself with a new domain object rather than letting the framework manage it itself.

Bobblehead · 03-18-2011, 10:41 AM

Something looks strange there - how many tables do you have?

I see cat, animal, house and animal_house. You are only using animal_house in the inner query to determine the cat_id, but when you move out one level you don't have any relation between house and cat or animal. Animal and cat are an inner join, but with no relation to house you are triggering a cartesian join. When you maually propagate the house_id inside the inner query it is probably still a cartesian, but since you have limited the result set it is really fast, and the optimizer isn't intelligent enough to propagate the value first.

I don't use postgres, but based on your description that is what it looks like to me.

photon · 03-18-2011, 11:07 AM

Interesting.

I do have only one side of my many to many defined in my domain (houses keep track of animals, but animals don't care about houses), I wonder if I added that relationship if it would change that query. Because otherwise that would seem like a pretty basic mistake that would get caught early on.

Unless everyone just manages their own many to many anyway so never actually get this code. Or they all use trivial cases.

Bobblehead · 03-18-2011, 11:13 AM

Quote:

Originally Posted by photon

Interesting.

I do have only one side of my many to many defined in my domain (houses keep track of animals, but animals don't care about houses), I wonder if I added that relationship if it would change that query. Because otherwise that would seem like a pretty basic mistake that would get caught early on.

Unless everyone just manages their own many to many anyway so never actually get this code. Or they all use trivial cases.

Often there will only ever be a trivial response and in those cases the fact that it is cartesian is irrelevant. Zx1 = Z.

But prettymuch anything that takes forever is some sort of non-optimized operation, either your correlation is creating an issue (which still might be the case here) or you are triggering a cartesian.

If you have any sore of analyzer you could see. I work with Oracle and explain plans can be a godsend to diagnose these things (even nicer since Oracle release SQL Developer a few years ago with an "explain plan" button)

Edit: I'm still having a bit of trouble seeing the relationships in your query.Too bad there is no ER diagram.

psyang · 03-18-2011, 11:31 AM

Quote:

Originally Posted by Bobblehead

Something looks strange there - how many tables do you have?

I see cat, animal, house and animal_house. You are only using animal_house in the inner query to determine the cat_id, but when you move out one level you don't have any relation between house and cat or animal. Animal and cat are an inner join, but with no relation to house you are triggering a cartesian join. When you maually propagate the house_id inside the inner query it is probably still a cartesian, but since you have limited the result set it is really fast, and the optimizer isn't intelligent enough to propagate the value first.

I don't use postgres, but based on your description that is what it looks like to me.

Yeah, you are right. But I'd be surprised if the cartesian join was causing the performance hit since there is a direct filter on house.id restricting house to a single row. If the same query format were used to find # of cats in, say, 10 houses, then you will get slower performance (though not noticeably), but also invalid results. Bottom line - it's a very poor query structure and even the fact that it is auto-generated isn't much of an excuse. I'd expect even an auto-generated query could come up with something more like:

Code:

select count(cat) from cat, animal, house_animal, house where
  cat.id = animal.id and
  animal.id = house_animal.animal_id and
  house_animal.house_id = house.id and
  house.id = 10690

photon · 03-18-2011, 11:34 AM

Sorry yeah I had to edit the query away from my real domain, NDA and all that.

Yeah I've been using explain and learning what PostgreSQL means with it (I'm used to MySQL), and I know much better now what their version of explain says since I asked this question so I'll have to go back and see what it says.

A cat is an animal, a house has one or more animals (and therefore one or more cats), an animal (and cat) has one or more houses. house_animal stores the relationship between them.

photon · 03-18-2011, 11:40 AM

Quote:

Originally Posted by psyang

Bottom line - it's a very poor query structure and even the fact that it is auto-generated isn't much of an excuse. I'd expect even an auto-generated query could come up with something more like:

Code:

select count(cat) from cat, animal, house_animal, house where
  cat.id = animal.id and
  animal.id = house_animal.animal_id and
  house_animal.house_id = house.id and
  house.id = 10690

That's probably because of the actual HQL query I have to use.

I'm using Hibernate, and because I'm only caring about the house-animal relationship from the house side of things, I had to generate a query in Hiberate like:

"SELECT count(c) from Cat c, House h WHERE c member of h.cats and h.id = :tid"

using HQL, Hibernate's query language, which refers to things at an object level, I don't have access to the house_animal table at the HQL level (unless I create my own relationship object HouseAnimal, which is what I will do).