Two-way two-locus statistics (python proto)#3179
Two-way two-locus statistics (python proto)#3179jeromekelleher merged 1 commit intotskit-dev:mainfrom
Conversation
|
Since @apragsdale isn't in the tskit organization, pinging him here to add as reviewer. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #3179 +/- ##
=======================================
Coverage 89.63% 89.63%
=======================================
Files 28 28
Lines 32004 32004
Branches 5879 5879
=======================================
Hits 28688 28688
Misses 1886 1886
Partials 1430 1430
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
b77bc1b to
9f8222e
Compare
apragsdale
left a comment
There was a problem hiding this comment.
Thanks @lkirk.
If we compute two-way statistics on identical sample sets, they should be equal to the one-way statistics. Unfortunately, this does not apply to unbiased statistics, which I've validated manually.
This is ok (even expected) for the unbiased statistics, since the formulation assumes sampling without replacement, but having overlapping sample sets would essentially copy those repeated sample nodes. I don't think we should worry about it.
834f0c2 to
dc86340
Compare
apragsdale
left a comment
There was a problem hiding this comment.
Thanks for making that change. This looks good to me then.
jeromekelleher
left a comment
There was a problem hiding this comment.
I've had a look through and the code all LGTM. The changes seem sensible, but I'd need to spend a lot longer on it to really digest the details and understand the finer points.
I'm going to ping @petrelharp here for a review as he's much better at this sort of stuff
| check_set_indexes(num_sample_sets, tuple_size * num_index_tuples, index_tuples) | ||
|
|
||
|
|
||
| def ld_matrix( |
There was a problem hiding this comment.
this is the proposed API
|
Summary: this PR is
|
|
Input: "one-way versus two-way" is determined by whether or not there are indexes.
Output: full is a 3D array
The only question is whether the third dimension is dropped.
|
67e7f01 to
92418ba
Compare
This PR implements two-way LD statistics, specified between sample sets. During the development of this functionality, a number of issues with the designation of state_dims/result_dims were discovered. These have been resolved, testing clean for existing code and providing the proper behavior for this new code. The mechanism by which users will specify a multi-population (or two-way) statistic is by providing the `indexes` argument. This helps us avoid creating another `ld_matrix` method for the TreeSequence object. In other words, for a one-way statistic, a user would specify: ```python ts.ld_matrix(stat="D2", sample_sets=[ss1, ss2]) ``` Which would output a 3D array containing one LD matrix per sample set. ```python ts.ld_matrix(stat="D2", sample_sets=[ss1, ss2], indexes=(0, 1)) ``` Will output a 2D array containing one LD matrix for the index pair. This would use our `D2_ij_summary_func`, instead of the `D2_summary_func`. Finally, ```python ts.ld_matrix(stat="D2", sample_sets=[ss1, ss2], indexes=[(0, 1), (1, 1)]) ``` will output a 3D array containing one LD matrix _per_ index pair provided. Tests have been added to validate the result dimension for all of the possible input combinations. Since these are two-way statistics, the indexes must be length 2. We plan on enabling users to implement k-way via a "general_stat" api. We did not implement anything more than two-way statistics here because of the combinatoric explosion of logic required for indexes > 2. I added some basic tests to demonstrate that things were working properly. If we compute two-way statistics on identical sample sets, they should be equal to the one-way statistics. This test does not work on unbiased statistics unless the sample sets being tested are equal in index. I've added another test for two-way unbiased statistics. I've also cleaned up the docstrings a bit and fixed a bug with the D_prime statistic, which should not be weighted by haplotype frequency.
92418ba to
6274d10
Compare
jeromekelleher
left a comment
There was a problem hiding this comment.
I'm happy to merge this as a step in the right direction - are you ok to merge @petrelharp ?
|
Sure, I'm happy to merge, although to help me stay on top of what's next it'd be nice to open an issue with a checklist of what needs to happen. Two things for that list that I'm thinking about now are:
|
|
I created such an issue noting this. If everything looks ok, shall we merge and then tackle those points in an upcoming PR? |
This PR implements two-way LD statistics, specified between sample sets. During the development of this functionality, a number of issues with the designation of state_dims/result_dims were discovered. These have been resolved, testing clean for existing code and providing the proper behavior for this new code.
The mechanism by which users will specify a multi-population (or two-way) statistic is by providing the
indexargument. This helps us avoid creating anotherld_matrixmethod for the TreeSequence object.In other words, for a one-way statistic, a user would specify:
Which would output a 3D ndarray containing one LD matrix per sample set.
Which would output a 2D ndarray containing one LD matrix for the index pair. This would use our
D2_ij_summary_func, instead of theD2_summary_func. Finally, if a user providedWe would output a 3D ndarray containing one LD matrix per index pair provided.
Since these are two-way statistics, the indexes must be length 2. We plan on enabling users to implement k-way via a "general_stat" api. We did not implement anything more than two-way statistics here because of the combinatoric explosion of logic required for indexes > 2.
I added some basic tests to demonstrate that things were working properly. If we compute two-way statistics on identical sample sets, they should be equal to the one-way statistics. Unfortunately, this does not apply to unbiased statistics, which I've validated manually.
I've also cleaned up the docstrings a bit and fixed a bug with the D_prime statistic, which should not be weighted by haplotype frequency.
Fixes #2832