Skip to content

fix: postgres search query does not use index#684

Merged
dluc merged 5 commits intomicrosoft:mainfrom
GenText:fix/postgresquery
Jul 9, 2024
Merged

fix: postgres search query does not use index#684
dluc merged 5 commits intomicrosoft:mainfrom
GenText:fix/postgresquery

Conversation

@roldengarm
Copy link
Contributor

Motivation and Context (Why the change? What's the scenario?)

I've imported about 6 million documents with PostgresDB as backend. I've created an HNSW index. When doing a search (using WebClient.Ask/Search) I keep getting timeouts. Upon further investigation, I saw that the query used in Kernel Memory is not optimisable as it's doing an inline calculation of "1 - the difference".

After I removed the "1 - " in our fork & deployed it, it was successfully using the index and I would get a response within 5 - 10 seconds.
I've used pgAdmin4 to confirm that the new query is using the index.

See also discussion with @dluc here: #663 (reply in thread)

@roldengarm roldengarm requested a review from dluc as a code owner July 3, 2024 23:31
@marcominerva
Copy link
Contributor

I think that, as this PR changes the returned similarityActualValue from PostgresSQL, also the minSimilarity usage must be adjusted.

In fact, going deep in the code, I see this:

sqlUserValues[similarityPlaceholder] = minSimilarity;

cmd.CommandText = @$"
SELECT {columns}, 1 - ({this._colEmbedding} <=> @embedding) AS {similarityActualValue}
FROM {tableName}
WHERE {filterSql}
ORDER BY {similarityActualValue} DESC
LIMIT @limit
OFFSET @offset
";
cmd.Parameters.AddWithValue("@embedding", target);
cmd.Parameters.AddWithValue("@limit", limit);
cmd.Parameters.AddWithValue("@offset", offset);
foreach (KeyValuePair<string, object> kv in sqlUserValues)
{
cmd.Parameters.AddWithValue(kv.Key, kv.Value);
}

In particular, the foreach adds sqlUserValues (included similarityPlaceholder= "@__min_similarity") to query parameters, but this parameter seems not to be used.

Instead, records are filtered by similarity after the query execution:

var run = true;
while (run && await dataReader.ReadAsync(cancellationToken).ConfigureAwait(false))
{
double similarity = dataReader.GetDouble(dataReader.GetOrdinal(similarityActualValue));
if (similarity < minSimilarity)
{
run = false;
continue;
}
result.Add((this.ReadEntry(dataReader, withEmbeddings), similarity));
}

@roldengarm
Copy link
Contributor Author

Thanks for checking @marcominerva I think you're right. Perhaps it worked at my end as minSimilarity was 0, haven't checked. But yes, I guess the check would have to be flipped.

But, wouldn't it be better & easier to read if the minSimilarity check would be part of the query? Then it could also be optimised by Postgres, instead of having an in-memory search / filtering afterwards. Would need to check filterSql a bit further, as in my test case it was empty / TRUE.

@marcominerva
Copy link
Contributor

marcominerva commented Jul 4, 2024

But, wouldn't it be better & easier to read if the minSimilarity check would be part of the query? Then it could also be optimised by Postgres, instead of having an in-memory search / filtering afterwards. Would need to check filterSql a bit further, as in my test case it was empty / TRUE.

Yes, it is exactly what I mean when I said that minSimilarity parameter is currently not used in the query.

@roldengarm
Copy link
Contributor Author

But, wouldn't it be better & easier to read if the minSimilarity check would be part of the query? Then it could also be optimised by Postgres, instead of having an in-memory search / filtering afterwards. Would need to check filterSql a bit further, as in my test case it was empty / TRUE.

Yes, it is exactly what I mean when I said that minSimilarity parameter is currently not used in the query.

Hi @marcominerva I've made the changes you recommended, the difference filter is now being done in the SQL query.
For some reason, I could not do "WHERE {colDifference} < {maxDifference}", that was causing a "table 'km-default'" can not be found.

The query still performs well, but on my large dataset I'm getting different results now. I'll investigate further as obviously the results should be similar. If you have a chance to review and see any obvious mistakes, that would be great.

@marcominerva
Copy link
Contributor

What you have called difference is actually the distance (the closer it is to 0, the closer the vectors themselves are). I suggest you to rename the code in this way to better specify the intent.

Moreover, Kernel Memory works with the concept of relevance (the closer it is to 1, more relevant the vectors are). All memories return records with the most relevance first. See for example:

IAsyncEnumerable<(MemoryRecord, double)> matches = this._memoryDb.GetSimilarListAsync(
index: index,
text: question,
filters: filters,
minRelevance: minRelevance,
limit: this._config.MaxMatchesCount,
withEmbeddings: false,
cancellationToken: cancellationToken);
// Memories are sorted by relevance, starting from the most relevant
await foreach ((MemoryRecord memory, double relevance) in matches.ConfigureAwait(false))
{

Instead, now in you're code the result list is ordered in an ascending way based on the distance (even if you calculate the similarity, the list is created reading the records that are returned by the query). So, I think you need to reverse the result list.

@roldengarm
Copy link
Contributor Author

What you have called difference is actually the distance (the closer it is to 0, the closer the vectors themselves are). I suggest you to rename the code in this way to better specify the intent.

Thanks, I've changed it to distance.

Instead, now in you're code the result list is ordered in an ascending way based on the distance (even if you calculate the similarity, the list is created reading the records that are returned by the query). So, I think you need to reverse the result list.

I have already reversed the ORDER BY from DESC (on similarity) to ASC (on distance), so I think that's correct. Upon further investigation it turns out the results were not different.
Can you please do a review and see if it can be merged? @marcominerva

@roldengarm
Copy link
Contributor Author

@microsoft-github-policy-service agree [company="GenText"]

@roldengarm
Copy link
Contributor Author

@microsoft-github-policy-service agree company="GenText"

@marcominerva
Copy link
Contributor

marcominerva commented Jul 8, 2024

I have already reversed the ORDER BY from DESC (on similarity) to ASC (on distance), so I think that's correct. Upon further investigation it turns out the results were not different. Can you please do a review and see if it can be merged? @marcominerva

I too have made a test with the old and the new approach and I have verified that now results are returned in the correct order. I would suggest only a last minor fix: rename the colDistance variable using __ as prefix, as in the original code:

string colDistance = "__distance";

Remember also to correct the comment at line 437 (you're still using colDifference).

After that, the code seems OK to me. Now we must wait for the approval by @dluc.

}
#pragma warning restore CA2100

this._log.LogTrace("SQL: {0}", cmd.CommandText);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
this._log.LogTrace("SQL: {0}", cmd.CommandText);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks @dluc

@dluc
Copy link
Collaborator

dluc commented Jul 8, 2024

could you grant maintainers edit on the PR?

@dluc dluc added waiting for author Waiting for author to reply or address comments work in progress labels Jul 8, 2024
@roldengarm
Copy link
Contributor Author

could you grant maintainers edit on the PR?

Sorry @dluc , that option isn't there, I guess because it's not a user-owned fork, but an org owned? Unless I'm missing something, but I believe the option should appear here on the right at the bottom based on Github Docs.

{
filterSql = "TRUE";
}
var maxDistance = 1 - minSimilarity;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var maxDistance = 1 - minSimilarity;
var maxDistance = 1 - minSimilarity;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, codestyle: missing empty line

@dluc
Copy link
Collaborator

dluc commented Jul 9, 2024

could you grant maintainers edit on the PR?

Sorry @dluc , that option isn't there, I guess because it's not a user-owned fork, but an org owned? Unless I'm missing something, but I believe the option should appear here on the right at the bottom based on Github Docs.

please upvote https://github.com/orgs/community/discussions/5634

@dluc dluc merged commit 3515ab4 into microsoft:main Jul 9, 2024
dluc added a commit that referenced this pull request Jul 16, 2024
## Motivation and Context (Why the change? What's the scenario?)

During string interpolation if dotnet culture is set to French it used
comma as the decimal separator instead of period, which cased SQL error
"unexpected ,"

## High level description (Approach, Design)

Fix regression introduced in #684, using SQL parameters instead of string interpolation.

---------

Co-authored-by: Konstantine Kalbazov <konstantine.kalbazov@veepee.com>
Co-authored-by: Devis Lucato <dluc@users.noreply.github.com>
dluc pushed a commit that referenced this pull request May 8, 2025
## Motivation and Context (Why the change? What's the scenario?)

Since #684 The PostgresDbClient will fail to return results that match
the minSimilarity requirement when multiple filters are used. This is
due to how the ```WHERE``` clause is prepared: ```filter1 OR filter2 OR
filter3 AND embedding <=> @Embedding < @maxDistance``` which cannot work
as expected since the `AND` operator takes precedence over the `OR`
operator
## High level description (Approach, Design)
Simply add parenthesis around the filters argument
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting for author Waiting for author to reply or address comments work in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants