ENH: Generate var_names from the data and partial predict#98
ENH: Generate var_names from the data and partial predict#98thequackdaddy wants to merge 6 commits intopydata:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## master #98 +/- ##
==========================================
+ Coverage 98.96% 98.99% +0.03%
==========================================
Files 30 30
Lines 5585 5760 +175
Branches 775 803 +28
==========================================
+ Hits 5527 5702 +175
Misses 35 35
Partials 23 23
Continue to review full report at Codecov.
|
b0dc258 to
460a6f9
Compare
|
I went ahead and built the Here's a basic example: In [1]: from patsy import dmatrix
...: import pandas as pd
...: import numpy as np
...:
...: data = pd.DataFrame({'categorical': ['a', 'b', 'c', 'b', 'a'],
...: 'integer': [1, 3, 7, 2, 1],
...: 'flt': [1.5, 0.0, 3.2, 4.2, 0.7]})
...: dm = dmatrix('categorical * np.log(integer) + bs(flt, df=3, degree=3)',
...: data)
...: dm.design_info.partial({'categorical': ['a', 'b', 'c']})
...:
Out[1]:
array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 0., 0.]])
In [2]: dm.design_info.partial({'categorical': ['a', 'b'],
...: 'integer': [1, 2, 3, 4]},
...: product=True)
Out[2]:
array([[ 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0.69314718, 0. ,
0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 1.09861229, 0. ,
0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 1.38629436, 0. ,
0. , 0. , 0. , 0. ],
[ 0. , 1. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ],
[ 0. , 1. , 0. , 0.69314718, 0.69314718,
0. , 0. , 0. , 0. ],
[ 0. , 1. , 0. , 1.09861229, 1.09861229,
0. , 0. , 0. , 0. ],
[ 0. , 1. , 0. , 1.38629436, 1.38629436,
0. , 0. , 0. , 0. ]]) |
|
It seems like it would be simpler to query a The even simpler (from the API design / coupling perspective) would be to pass in a dict-like that loads the data lazily on demand, like: class LazyData(dict):
def __missing__(self, key):
try:
return bcolz.load(key, file)
except BcolzKeyNotFound:
raise KeyError(key)Would this work for you? Is the This is also missing lots of tests, but let's not worry about that until after the high-level discussion... |
Hmm... I hadn't thought of that. That should be relatively easy to add/change based on what I've done so far. The heart of this is the I presume you're implying that I shouldn't be worrying about the
This is really clever, thanks! I'll try it. However, I don't think it solves the
Yes.
Sound good. Writing tests is not something I've excelled at. This is somewhat tested and I (think) there is coverage for most of the new lines--likely I missed a few. I added some |
19ad339 to
e63da78
Compare
e63da78 to
807cc93
Compare
a79c5c8 to
050c220
Compare
050c220 to
544effd
Compare
b07ba3f to
48fd2e4
Compare
4f8a70c to
691eb4e
Compare
Hello,
I have a proposal that really came about because of the way I've been interacting with patsy.
My datasets are kind of long and kind of wide. I have lots of fields that I use for expoloring stuff, but naturally they just don't work out.
I've been using bcolz because it stores the data in a columnar fashion making horizontal slices really easy. Before, I'd been creating a list of variables that I wanted, defining all the transforms that I needed in patsy, and then feeding that through. I can't load the entire dataset into memory just because its too wide and long and I might only be looking at 20-30 columns for any one model.
So I propose having patsy attempt to figure out which columns it needs from the data using this new
var_namesmethod which is available onDesignInfo,EvalFactor, andTerm. In a nutshell, it gets a list of all the variables used, checks if that variable is defined in theEvalEnvironment, and if not, assumes it must be data.I've called this
var_namesfor now, but arguably maybenon_eval_var_namesmight be more accurate? Open to suggestions here.One nice thing is that when using
incr_dbuilder, it can automatically slice on the columns which makes the construction much faster (for me at least).Here's a gist demo'ing this.
https://gist.github.com/thequackdaddy/2e601afff4fbbfe42ed31a9b2925967d
Let me know what you think.