fix: postgres search query does not use index#684
Conversation
|
I think that, as this PR changes the returned In fact, going deep in the code, I see this: kernel-memory/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs Lines 432 to 448 in 3d34260 In particular, the Instead, records are filtered by similarity after the query execution: kernel-memory/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs Lines 458 to 469 in 3d34260 |
|
Thanks for checking @marcominerva I think you're right. Perhaps it worked at my end as minSimilarity was 0, haven't checked. But yes, I guess the check would have to be flipped. But, wouldn't it be better & easier to read if the minSimilarity check would be part of the query? Then it could also be optimised by Postgres, instead of having an in-memory search / filtering afterwards. Would need to check |
Yes, it is exactly what I mean when I said that |
Hi @marcominerva I've made the changes you recommended, the difference filter is now being done in the SQL query. The query still performs well, but on my large dataset I'm getting different results now. I'll investigate further as obviously the results should be similar. If you have a chance to review and see any obvious mistakes, that would be great. |
|
What you have called Moreover, Kernel Memory works with the concept of relevance (the closer it is to 1, more relevant the vectors are). All memories return records with the most relevance first. See for example: kernel-memory/service/Core/Search/SearchClient.cs Lines 227 to 238 in 3d34260 Instead, now in you're code the result list is ordered in an ascending way based on the distance (even if you calculate the similarity, the list is created reading the records that are returned by the query). So, I think you need to reverse the result list. |
Thanks, I've changed it to distance.
I have already reversed the ORDER BY from DESC (on similarity) to ASC (on distance), so I think that's correct. Upon further investigation it turns out the results were not different. |
|
@microsoft-github-policy-service agree [company="GenText"] |
|
@microsoft-github-policy-service agree company="GenText" |
I too have made a test with the old and the new approach and I have verified that now results are returned in the correct order. I would suggest only a last minor fix: rename the
Remember also to correct the comment at line 437 (you're still using After that, the code seems OK to me. Now we must wait for the approval by @dluc. |
| } | ||
| #pragma warning restore CA2100 | ||
|
|
||
| this._log.LogTrace("SQL: {0}", cmd.CommandText); |
There was a problem hiding this comment.
| this._log.LogTrace("SQL: {0}", cmd.CommandText); |
|
could you grant maintainers edit on the PR? |
Sorry @dluc , that option isn't there, I guess because it's not a user-owned fork, but an org owned? Unless I'm missing something, but I believe the option should appear here on the right at the bottom based on Github Docs. |
| { | ||
| filterSql = "TRUE"; | ||
| } | ||
| var maxDistance = 1 - minSimilarity; |
There was a problem hiding this comment.
| var maxDistance = 1 - minSimilarity; | |
| var maxDistance = 1 - minSimilarity; |
There was a problem hiding this comment.
nit, codestyle: missing empty line
please upvote https://github.com/orgs/community/discussions/5634 |
## Motivation and Context (Why the change? What's the scenario?) During string interpolation if dotnet culture is set to French it used comma as the decimal separator instead of period, which cased SQL error "unexpected ," ## High level description (Approach, Design) Fix regression introduced in #684, using SQL parameters instead of string interpolation. --------- Co-authored-by: Konstantine Kalbazov <konstantine.kalbazov@veepee.com> Co-authored-by: Devis Lucato <dluc@users.noreply.github.com>
## Motivation and Context (Why the change? What's the scenario?) Since #684 The PostgresDbClient will fail to return results that match the minSimilarity requirement when multiple filters are used. This is due to how the ```WHERE``` clause is prepared: ```filter1 OR filter2 OR filter3 AND embedding <=> @Embedding < @maxDistance``` which cannot work as expected since the `AND` operator takes precedence over the `OR` operator ## High level description (Approach, Design) Simply add parenthesis around the filters argument
Motivation and Context (Why the change? What's the scenario?)
I've imported about 6 million documents with PostgresDB as backend. I've created an HNSW index. When doing a search (using WebClient.Ask/Search) I keep getting timeouts. Upon further investigation, I saw that the query used in Kernel Memory is not optimisable as it's doing an inline calculation of "1 - the difference".
After I removed the "1 - " in our fork & deployed it, it was successfully using the index and I would get a response within 5 - 10 seconds.
I've used pgAdmin4 to confirm that the new query is using the index.
See also discussion with @dluc here: #663 (reply in thread)