I have a custom query which calls entries and displays them based on either zipcode, category, or both. Currently, if an entry has two categories assigned to it, it will call that entry twice. I am not sure how to prevent this. Here is the current query.
The two embed variables being passed are coming from PHP post data.
{exp:query sql="
SELECT
t.title,
t.url_title AS permalink,
t.entry_id AS entry_id,
d.field_id_32 AS phone_foodbank,
d.field_id_46 AS crossstreets_foodbank,
d.field_id_33 AS hours_foodbank,
d.field_id_47 AS languages_foodbank,
d.field_id_36 AS eligibility_foodbank,
d.field_id_31 AS zipcode_foodbank
FROM
exp_weblog_data AS d,
exp_weblog_titles AS t,
exp_category_posts AS c
WHERE t.weblog_id = '14'
AND t.entry_id = d.entry_id
AND t.entry_id = c.entry_id
{if embed:listcat != ""}AND c.cat_id IN ({embed:listcat}) {/if}
{if embed:listzip != ""}AND d.field_id_31 = '{embed:listzip}' {/if}
"}Thanks!
[Mod Edit: Moved to the Development and Programming forum]
If I understand what you’re trying to do, a straightforward SELECT DISTINCT may work.
Though the link below talks about tables, I believe its actual definition is to return unique rows in whatever set your query results in.
In this case that result set is from the join which includes the IN (…) condition, which is what makes the possibility for duplicate rows.
To try it, just put the word DISTINCT after SELECT in your code.
Hello again, Aaren,
Well, the question you ask with DevDemon’s suggestion gets into interesting areas. It’s been a very long time since I have been near any of these, but let’s see what we can learn.
First, using GROUP BY without aggregates is apparently a unique ability of MySQL – and one that people question, as it isn’t ‘standards-based’ SQL. This doesn’t make it wrong, and I am interested in what DevDemon says about intermediate tables, but it does mean that using SELECT DISTINCT would be more portable if your site needed to work on another database type.
Cutting to the chase here, to see the results, it also turns out the SELECT DISTINCT will be more efficient, though at the level of data size likely with ExpressionEngine, this doesn’t practically matter. But it might be interesting to look into why, and how you would discover this – and how the results might have turned out differently.
Aaren, you may know all of what I would say here, but will write it anyway as I wanted to learn myself a little bit more about this contemporary database, and to leave a path.
To begin, we need to remember that SQL is a declarative language – it lets you say what results you want, but not how those results will be achieved. The query engine optimizer works out the ‘best’ way for the circumstances, and it can be arbitrarily capable and complex, and accurate as far as the special abilities of a given database implementation.
How you write the query does influence what the optimizer chooses, and there can even be special flags to influence it. The only way to learn what it’s doing is to get the optimizer to tell you its plan, and do that for each of your alternatives. You can also measure the performance if that’s important for the different plans, if that would not be evident – which it seldom is.
MySQL will give you its plan if you run a variant of your query prefixed by EXPLAIN, or EXPLAIN EXTENDED. You can do this from the Query Form of the SQL Manager in ExpressionEngine, and I’ll show you two results in the images. If you have phpMyAdmin access, you can also see the timings for about 20 aspects of how the query is actually run, and I did this too.
I set up a similar query to yours, but with ip_address as the non-unique field since I don’t have your custom fields. What I see is as follows, and you can look at the images to see for yourself as far as the query plans.
If a temporary table grows very large so that it can’t be held in memory, then it could be a problem, but isn’t necessarily any more so than a huge database table itself, given in either case that an index can be used. You see here the kinds of things optimizers can and do arrange, and none of it will be important at the scale of information size we would likely see in an ExpressionEngine scenario.
What actually is important is how the data is dealt with – mostly, whether an index can be used, with all its efficiencies, to choose the data that you need. In fact, the plan is telling us that the SELECT DISTINCT version is likely much more efficient, and certainly would be so on a large dataset. That’s because it uses an index for each of the matchings. The GROUP BY variation, on the other hand, uses a sort routine on the data in the temporary table, effectively building an index at the time of your execution, which is very expensive. If you look into it, you’ll find that ‘filesort’ doesn’t necessarily imply an external file, but the sort operation itself is just expensive. So on this basis as well as standards for portability, it looks like SELECT DISTINCT would be preferred.
I want to say here that there’s really no way to know the preference except by checking the optimizer plan, and also testing with full-size datasets or simulations for the real application. It’s a deep art, when the datasets are huge. But that is not our case here, and in fact when I ran the timings on an ExpressionEngine implementation with 100 posting on it, the two methods are so fast, and so close together, that you really can’t tell - it’s down to variations due to other loads on the database engine. Both queries returned in a total of around 150 microseconds, and even the sort operation isn’t particularly visible in the breakdown of where the query spends its time. That’s partially because the sort is being done on an integer field - it might be much more expensive if sorting had to be done on a string field, just for reference.
Ok, that’s probably about 100 times as much as most would want to know about this, but I kind of feel you’re a fellow who likes to understand, Aaren, and I appreciate that – it gains much. Also there’s enough to reference here that anyone finding the discussion would be able to look into it farther on the net for their own purposes.
Also, it’s a good example to show the kinds of things ExpressionEngine and its designers are doing for us, in all the normal places where we don’t have to write our own queries.
Thanks to DevDemon for an interesting point to chew on, also. The results could have turned out another way, and there might still be something I’m missing here, as always.
Regards, Clive
devdemon, thanks.
I did realize there is a detail I slipped over. In the SELECT DISTINCT query plan, the clearing of duplicate entries is done by a full table scan on the temporary table, since there won’t be any index on it. This is still much more efficient than sorting it as GROUP BY does, but it would be a factor if the temporary table needed to be quite large.
Just another way of saying you are right that queries which don’t cause duplicates are the way to go, when possible. I don’t see that it is for Aaren’s case, but again, there may be an alternate way to organize the problem.
I suspect SELECT DISTINCT is far efficient enough, though, for an ExpressionEngine situation – and knowing where to stop is important, no doubt we’d agree 😉.
Regards, Clive
Packet Tide owns and develops ExpressionEngine. © Packet Tide, All Rights Reserved.