fix: postgres search query does not use index by roldengarm · Pull Request #684 · microsoft/kernel-memory

roldengarm · 2024-07-03T23:31:54Z

Motivation and Context (Why the change? What's the scenario?)

I've imported about 6 million documents with PostgresDB as backend. I've created an HNSW index. When doing a search (using WebClient.Ask/Search) I keep getting timeouts. Upon further investigation, I saw that the query used in Kernel Memory is not optimisable as it's doing an inline calculation of "1 - the difference".

After I removed the "1 - " in our fork & deployed it, it was successfully using the index and I would get a response within 5 - 10 seconds.
I've used pgAdmin4 to confirm that the new query is using the index.

See also discussion with @dluc here: #663 (reply in thread)

marcominerva · 2024-07-04T07:23:48Z

I think that, as this PR changes the returned similarityActualValue from PostgresSQL, also the minSimilarity usage must be adjusted.

In fact, going deep in the code, I see this:

kernel-memory/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs

Line 419 in 3d34260

sqlUserValues[similarityPlaceholder] = minSimilarity;

kernel-memory/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs

Lines 432 to 448 in 3d34260

    
                           cmd.CommandText = @$" 
        
                           SELECT {columns}, 1 - ({this._colEmbedding} <=> @embedding) AS {similarityActualValue} 
        
                           FROM {tableName} 
        
                           WHERE {filterSql} 
        
                           ORDER BY {similarityActualValue} DESC 
        
                           LIMIT @limit 
        
                           OFFSET @offset 
        
                       "; 
        
                           cmd.Parameters.AddWithValue("@embedding", target); 
        
                           cmd.Parameters.AddWithValue("@limit", limit); 
        
                           cmd.Parameters.AddWithValue("@offset", offset); 
        
                           foreach (KeyValuePair<string, object> kv in sqlUserValues) 
        
                           { 
        
                               cmd.Parameters.AddWithValue(kv.Key, kv.Value); 
        
                           }

In particular, the foreach adds sqlUserValues (included similarityPlaceholder= "@__min_similarity") to query parameters, but this parameter seems not to be used.

Instead, records are filtered by similarity after the query execution:

kernel-memory/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs

Lines 458 to 469 in 3d34260

    
           var run = true; 
        
           while (run && await dataReader.ReadAsync(cancellationToken).ConfigureAwait(false)) 
        
           { 
        
               double similarity = dataReader.GetDouble(dataReader.GetOrdinal(similarityActualValue)); 
        
               if (similarity < minSimilarity) 
        
               { 
        
                   run = false; 
        
                   continue; 
        
               } 
        
               result.Add((this.ReadEntry(dataReader, withEmbeddings), similarity)); 
        
           }

roldengarm · 2024-07-04T07:41:12Z

Thanks for checking @marcominerva I think you're right. Perhaps it worked at my end as minSimilarity was 0, haven't checked. But yes, I guess the check would have to be flipped.

But, wouldn't it be better & easier to read if the minSimilarity check would be part of the query? Then it could also be optimised by Postgres, instead of having an in-memory search / filtering afterwards. Would need to check filterSql a bit further, as in my test case it was empty / TRUE.

marcominerva · 2024-07-04T07:49:52Z

But, wouldn't it be better & easier to read if the minSimilarity check would be part of the query? Then it could also be optimised by Postgres, instead of having an in-memory search / filtering afterwards. Would need to check filterSql a bit further, as in my test case it was empty / TRUE.

Yes, it is exactly what I mean when I said that minSimilarity parameter is currently not used in the query.

roldengarm · 2024-07-06T04:50:31Z

But, wouldn't it be better & easier to read if the minSimilarity check would be part of the query? Then it could also be optimised by Postgres, instead of having an in-memory search / filtering afterwards. Would need to check filterSql a bit further, as in my test case it was empty / TRUE.

Yes, it is exactly what I mean when I said that minSimilarity parameter is currently not used in the query.

Hi @marcominerva I've made the changes you recommended, the difference filter is now being done in the SQL query.
For some reason, I could not do "WHERE {colDifference} < {maxDifference}", that was causing a "table 'km-default'" can not be found.

The query still performs well, but on my large dataset I'm getting different results now. I'll investigate further as obviously the results should be similar. If you have a chance to review and see any obvious mistakes, that would be great.

marcominerva · 2024-07-06T10:55:23Z

What you have called difference is actually the distance (the closer it is to 0, the closer the vectors themselves are). I suggest you to rename the code in this way to better specify the intent.

Moreover, Kernel Memory works with the concept of relevance (the closer it is to 1, more relevant the vectors are). All memories return records with the most relevance first. See for example:

kernel-memory/service/Core/Search/SearchClient.cs

Lines 227 to 238 in 3d34260

    
           IAsyncEnumerable<(MemoryRecord, double)> matches = this._memoryDb.GetSimilarListAsync( 
        
               index: index, 
        
               text: question, 
        
               filters: filters, 
        
               minRelevance: minRelevance, 
        
               limit: this._config.MaxMatchesCount, 
        
               withEmbeddings: false, 
        
               cancellationToken: cancellationToken); 
        
           // Memories are sorted by relevance, starting from the most relevant 
        
           await foreach ((MemoryRecord memory, double relevance) in matches.ConfigureAwait(false)) 
        
           {

Instead, now in you're code the result list is ordered in an ascending way based on the distance (even if you calculate the similarity, the list is created reading the records that are returned by the query). So, I think you need to reverse the result list.

roldengarm · 2024-07-08T00:01:09Z

What you have called difference is actually the distance (the closer it is to 0, the closer the vectors themselves are). I suggest you to rename the code in this way to better specify the intent.

Thanks, I've changed it to distance.

Instead, now in you're code the result list is ordered in an ascending way based on the distance (even if you calculate the similarity, the list is created reading the records that are returned by the query). So, I think you need to reverse the result list.

I have already reversed the ORDER BY from DESC (on similarity) to ASC (on distance), so I think that's correct. Upon further investigation it turns out the results were not different.
Can you please do a review and see if it can be merged? @marcominerva

roldengarm · 2024-07-08T00:01:28Z

@microsoft-github-policy-service agree [company="GenText"]

roldengarm · 2024-07-08T01:05:49Z

@microsoft-github-policy-service agree company="GenText"

marcominerva · 2024-07-08T07:43:11Z

I have already reversed the ORDER BY from DESC (on similarity) to ASC (on distance), so I think that's correct. Upon further investigation it turns out the results were not different. Can you please do a review and see if it can be merged? @marcominerva

I too have made a test with the old and the new approach and I have verified that now results are returned in the correct order. I would suggest only a last minor fix: rename the colDistance variable using __ as prefix, as in the original code:

string colDistance = "__distance";

Remember also to correct the comment at line 437 (you're still using colDifference).

After that, the code seems OK to me. Now we must wait for the approval by @dluc.

dluc · 2024-07-08T14:36:24Z

extensions/Postgres/Postgres/Internals/PostgresDbClient.cs

                }
 #pragma warning restore CA2100
-
+                this._log.LogTrace("SQL: {0}", cmd.CommandText);


Suggested change

this._log.LogTrace("SQL: {0}", cmd.CommandText);

Fixed, thanks @dluc

dluc · 2024-07-08T14:38:07Z

could you grant maintainers edit on the PR?

extensions/Postgres/Postgres/Internals/PostgresDbClient.cs

roldengarm · 2024-07-08T23:29:56Z

could you grant maintainers edit on the PR?

Sorry @dluc , that option isn't there, I guess because it's not a user-owned fork, but an org owned? Unless I'm missing something, but I believe the option should appear here on the right at the bottom based on Github Docs.

dluc · 2024-07-09T06:59:55Z

extensions/Postgres/Postgres/Internals/PostgresDbClient.cs

        {
            filterSql = "TRUE";
        }
+        var maxDistance = 1 - minSimilarity;


Suggested change

var maxDistance = 1 - minSimilarity;

var maxDistance = 1 - minSimilarity;

nit, codestyle: missing empty line

dluc · 2024-07-09T11:34:38Z

could you grant maintainers edit on the PR?

Sorry @dluc , that option isn't there, I guess because it's not a user-owned fork, but an org owned? Unless I'm missing something, but I believe the option should appear here on the right at the bottom based on Github Docs.

please upvote https://github.com/orgs/community/discussions/5634

## Motivation and Context (Why the change? What's the scenario?) During string interpolation if dotnet culture is set to French it used comma as the decimal separator instead of period, which cased SQL error "unexpected ," ## High level description (Approach, Design) Fix regression introduced in #684, using SQL parameters instead of string interpolation. --------- Co-authored-by: Konstantine Kalbazov <konstantine.kalbazov@veepee.com> Co-authored-by: Devis Lucato <dluc@users.noreply.github.com>

@Embedding

## Motivation and Context (Why the change? What's the scenario?) Since #684 The PostgresDbClient will fail to return results that match the minSimilarity requirement when multiple filters are used. This is due to how the ```WHERE``` clause is prepared: ```filter1 OR filter2 OR filter3 AND embedding <=> @Embedding < @maxDistance``` which cannot work as expected since the `AND` operator takes precedence over the `OR` operator ## High level description (Approach, Design) Simply add parenthesis around the filters argument

test: remove 1 - calculation

a803d59

roldengarm requested a review from dluc as a code owner July 3, 2024 23:31

fix: use difference in sql query

f791fb1

chore: rename difference to distance

105abdf

dluc reviewed Jul 8, 2024

View reviewed changes

extensions/Postgres/Postgres/Internals/PostgresDbClient.cs Outdated Show resolved Hide resolved

dluc added waiting for author Waiting for author to reply or address comments work in progress labels Jul 8, 2024

Roland Oldengarm and others added 2 commits July 9, 2024 11:25

chore: revert logging and fix comment

9db5fc2

Merge branch 'main' into fix/postgresquery

14f68c7

dluc reviewed Jul 9, 2024

View reviewed changes

dluc approved these changes Jul 9, 2024

View reviewed changes

dluc merged commit 3515ab4 into microsoft:main Jul 9, 2024

dluc mentioned this pull request Jul 16, 2024

fix: maxDistance cast to string #698

Merged

mrocha51248 mentioned this pull request Apr 29, 2025

Fix PostgresDbClient GetSimilarAsync minSimilarity requirement #1057

Merged

	var maxDistance = 1 - minSimilarity;

	var maxDistance = 1 - minSimilarity;

Conversation

roldengarm commented Jul 3, 2024

Motivation and Context (Why the change? What's the scenario?)

Uh oh!

marcominerva commented Jul 4, 2024

Uh oh!

roldengarm commented Jul 4, 2024

Uh oh!

marcominerva commented Jul 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

roldengarm commented Jul 6, 2024

Uh oh!

marcominerva commented Jul 6, 2024

Uh oh!

roldengarm commented Jul 8, 2024

Uh oh!

roldengarm commented Jul 8, 2024

Uh oh!

roldengarm commented Jul 8, 2024

Uh oh!

marcominerva commented Jul 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dluc Jul 8, 2024

Choose a reason for hiding this comment

Uh oh!

roldengarm Jul 8, 2024

Choose a reason for hiding this comment

Uh oh!

dluc commented Jul 8, 2024

Uh oh!

Uh oh!

roldengarm commented Jul 8, 2024

Uh oh!

dluc Jul 9, 2024

Choose a reason for hiding this comment

Uh oh!

dluc Jul 9, 2024

Choose a reason for hiding this comment

Uh oh!

dluc commented Jul 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

marcominerva commented Jul 4, 2024 •

edited

Loading

marcominerva commented Jul 8, 2024 •

edited

Loading