This repository has been archived by the owner on Oct 12, 2020. It is now read-only.
forked from TaxIPP-Life/liam2
-
Notifications
You must be signed in to change notification settings - Fork 0
/
TODO
901 lines (768 loc) · 42.1 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
--------------------------------------
TODO
--------------------------------------
longterm "cleanup" roadmap
==========================
This is a short list of the most important shortcomings of the current
implementation that I am aware of.
- add support for chunking to all operations to be able to simulate datasets
of any size
- entirely encapsulate the indexing stuff in the data provider layer
and possibly use pytables indexing engine
- automatic missing value propagation
- use a different value for missing int columns (but keep -1 for link columns)
- all arguments to all functions should accept expressions automatically
(there should be a base class handling this)
- expose all numpy functionalities
- pure Python mode
- use proper logging & warnings (use logging and warnings modules)
- full test coverage, including unit tests (this presuppose pure Python mode)
features I would like to implement (in no particular order)
===========================================================
- use a different "random sequence" for each random number "user" (each
alignment with frac_need=uniform, explicit call to a random number generator
like uniform(), or logit_score/logit_regr), so that divergences of two
simulations with the same seed do not leak to unrelated parts of the model
through the position in the random sequence. Ideally, it should work
even if one model use one more sequence than the other in the middle of
the simulation. That is, the sequence initial seed for one expression/user
should not be based on the position/order of the expression in the
simulation.
>>> dict based on the textual form of the expression?
for align/logit_score/logit_regr, it would probably work fine,
for uniform(), not so much because the "expression" is the same.
adding the sequence position in addition, would fix the "collision", but
would introduce the "unrelated" changes problem when inserting a random
"user" expression and this is exactly what we are trying to avoid in the
first place.
>>> optional explicit sequence name will be the most reliable solution but it is
also the most verbose and annoying. It would not be that annoying if we
only require an additional argument *without* requiring the user to
declare the sequence beforehand.
>>> the best option might be to use the position by default but skip any with
an explicit name, so that we can set explicit names only for the
"inserted" expressions (which is not present in both variants)
- add an option to disable (sub)process timings, so that a diff between two
simulation logs is more meaningful
- implement aggregates with filter argument on array with 2+ dimensions
for grpmin, grpmax & grpsum, we could specify "ignore values" for each
function and dtype and do values[filter_value] = ignore_value.
This might be slower than nanmin/nanmax but at least it would work.
For grpstd, it would be trickier but possible (ignore value is the mean)
but for median & percentile that method would simply not work.
- use putmask & put everywhere we do a[bool_array] = x and a[int_array] = x
as they seem to be universally (a bit) faster
- fiddle some more with the AST to support this syntax: 18 <= age <= 90
- implement a // b (as a cleaner and faster equivalent to trunc(a / b)
- implement some kind of global namespace, where we can put anything not
tied to an entity (eg, all scalar values, groupby results, ...), so that
we can access it from another entity. For example, we could use in
household a scalar (or groupby) value stored in a (temporary) global by
a person process
- make examples work out-of-the-box on Linux:
use "/" in the examples and use os.path.normpath(path)?
- try to create output directory if it does not exist
- allow to load datasets inline because I want to be consistent between align
& select_link and having to import all alignment files would be a pain
load('al_p_dead_m.csv', float, cached=True)
MIG: load('param\mig.csv', type=float)
othertable: load('param\othertable.csv', fields=[('INTFIELD', int),
('FLOATFIELD', float)])
periodic: load('param\globals.csv', fields='auto', transposed=True)
periodic: load('param\globals.csv', transposed=True, fields=[
(CPI, float),
(FWEMRA, float),
...
(ERA_R, float)])
# note that for periodic globals, I am unsure they need to be able to load
# them inline *but* being able to load them at run time instead of at
# import time would be great
# we need the cached=True option to specify whether the result
# can be cached or it should be (re)loaded each period: if used to exchange
# data with another program, it needs to load each period. But for some
# cases (most notably alignment), the same data can be kept in-memory.
# IDEA: for "constant" arrays, we need a way to specify which temporary
# variables should not be dropped at the end of the period so that we can
# do the load in init
# >>> in that case, declaring an explicit dataset would probably be simpler
- check/prevent duplicate ids during import
- switch to py3 syntax for prints everywhere
- add command to console to list macros
- make "othertable" indexable by another column than PERIOD
- make "variable checking" (collect_variable) prefix variable by their
entity and globals by their "table"
- merge code for loading ndarrays and tables (ie importer.load_ndarray and
importer.CSV(), they are both special cases of nd record arrays which I
would like to support at some point. For that I will need:
- support loading nd arrays from csv files formatted like "tables" (one
column per field). The only difference is that the last dimension is
not "expanded" horizontally.
- support nd array of "record" arrays (if using the "table" format, this is
only a matter of "folding" the non record "dimension"/columns (ie checking
that all combination of values are present -- because I don't want to
support sparse arrays yet, that they are correctly sorted -- else I should
sort them)
- support loading "tables" with the last "dimension" expanded horizontally
this means the "data" in those expanded columns is not named explicitly
(at least with the current format)
see person_salary.csv. This implies that the last column is a "dimension"
(ie, all rows have values for all the possible values for that field --
it can be the "missing" value though)
- somehow prevent running two simulations on the same output file at the same
time. I have a report that it causes all sorts of strange problems but I
could not reproduce it here (on my very small test case).
- allow to "import" globals during the simulation or at a minimum in the
declaration of the globals in the simulation file (effectively allowing to
bypass the import step for time series and other csv files)
- change dtype method on expr to return (shape, dtype) or something like that
the immediate goal is to be able to distinguish between arrays and scalars
in advance in methods like Variable.__getitem__
- when trying to "import" a simulation file or trying to "run" an import file,
output an informative error
- check versions of dependencies on startup, instead of letting it crash
randomly in the middle of a simulation with an error message difficult
to understand for users (for example with numexpr 1.4.2)
- when there is a divide by zero or 0/0, numpy outputs a RuntimeWarning
which seems to confuse users (especially 0/0 which produces a cryptic
warning: RuntimeWarning: invalid value encountered in double_scalars
This can happen at least in a groupby() call with percent=True when the
total is 0.0 (for example if expr = grpcount() and there is no individual
matching the filter - see example in the test simulation)
The situation in release 0.5.1 was even messier because of the code of
avglink which disabled the warning in the middle of the simulation and
did not restore the behaviour afterwards, which means that if we had
an avglink before an error, we did not get any warning, but we did get some
if we did not use any avglink or avglink was after the error.
I think the ideal solution here would be to let the user configure what he
wants (raise/warn/ignore) but show him the line number in his code, though
that will be hard to achieve. Easier to achieve would be to print the expr
where it happened. It should be possible to do this by using np.seterrcall:
>>> old_handler = np.seterrcall(my_handler)
>>> old_err = np.seterr(all='call')
and inspect the caller frame at that point (using sys._getframe([depth])
it would be cleaner to use np.seterr(all='raise') and catch the
exception in Expr.evaluate, possibly simply using add_context. But that
would only work for the "raise" behaviour which is not what all users want,
since divide by 0 and 0/0 can be normal/expected result.
for the "warn" or "message" case, I think the simplest option is to use
the caller frame hack. Another option would be to use seterr("raise") at
first but in the exception handler (at expr level), use seterr("ignore")
and re-evaluate the expression. Hmmm, that would only work once per class
of exception (invalid, overflow, ...), not once per exception location,
as warnings do (and this is the behaviour we would like).
yet another option is to call np.seterrcall(my_handler) before any expression
is evaluated by numpy. This might be slow though. We would use something
like:
def make_err_handler(expr):
msg = str(expr)
def handler(a, b, c): # check what the signature need to be
raise/print/... msg
return handler
def evaluate(self):
myhandler = make_err_handler(self)
np.seterrcall(myhandler)
[do real stuff here]
- make an installer with shortcuts, etc...
even though cx_freeze can produce basic installers and works fine to produce
the executable, their installers is not sufficient in our case
since we need a shortcut to notepad, not to the liam2 simulation engine
I guess I'll simply use NSIS
- improve examples:
* comment each new stuff as it appears the first time
? kill logit_regr without expr (logit_regr(0.0, align='al_p_dead_m.csv'):
use align(uniform(), 'al_p_dead_m.csv') instead?
* kill children link, use it or at least add a comment about it
(it is declared but not used in any example)
- implement all() and any()
- refactor/redesign the clunky parts
* the goals are:
- to make things testable by real unittests
- to have better separations of concepts / clearer "scopes" for each
class
* all entities in context
* hide id_to_rownum
* .array vs .table vs .lagged_array vs Entity vs EntityContext
vs full_context vs mapping passed to numexpr
.array doesn't need to be entirely in memory, .table
does not need to be on disk, but we need separate concepts in some places
because they need to support different functionalities: enlarge, remove,
add column, delete column for .array and only .append for .table
- use numexpr.E and precompile expressions as soon as possible, instead of
passing strings at evaluation time.
The problem is that numexpr.E always creates nodes of kind "double".
We could use a more generic NodeMaker:
class NodeMaker(object):
def __init__(self, kind=None):
self._kind = kind
def __getattr__(self, name):
if name.startswith('_'):
return self.__dict__[name]
else:
return ne.expressions.VariableNode(name, self._kind)
INTNODE = NodeMaker('int')
though, I think always using VariableNode directly would be simpler
In either case, if we want to be able to compile expressions earlier than
at evaluation, we will need to know the type of each variable.
For that to happen, there are a few problems:
* in Entity.variables, we create temporary variables without types. We need
the variables to exist to be able to parse the expressions.
We could either:
1) pre-parse expressions (using compile) to know which variables each
expression depend on, make a topological sort on that and parse
expressions in the correct order. This way, the types of all variables of
an expression should be known by the time it is parsed.
2) parse all expressions using untyped temporary variables (like we do
now), get all variables for each expression (using collect_variables),
topological sort all expressions which are stored in a temporary
variable and "type" them one by one.
Option 1 seem cleaner and will probably be faster but I wonder if it
will be able to handle my "conditional context" stuff (not sure option 2
can either).
- rewrite the whole import process: load boolean fields as int8 (shouldn't
need more work/conversion),
then do a n-way merge, then do interpolation.
If we haven't done the bool->int8 migration for the simulation code then
we will also need to convert boolean fields back to booleans.
In that case, we will have to copy the data using: b = i8 == 1, otherwise
"missing" (-1) evaluates as True, which is not what we want in most case
(and for backward compatibility)
- "compile" expressions to another representation which:
* has an uniform interface, so that it can be walked/traversed
recursively without having to define a traverse method on most
expression classes.
* has not its __eq__ method overloaded so that I can use "expr in d"
(dict.__contains__(k) calls k.__eq__) so that I can store
expressions in a dict to check whether they occur more than once
Questions:
* use ASTNode class in numexpr as is?
* would an uniform interface make things cleaner for current
functionalities (mainly as_string, collect_variables and simplify)?
- somehow work around numexpr 2.x limitation of 31 variables (create hidden
temporary variables?)
- better document csv format: -- for missing
- launch external command during simulation (for example external model in
another program). That should be easy. However, to be useful, we need
to be able to load back the result in the middle of a simulation.
- try storing columns individually instead of in a table (as is done currently)
? compressing main "table" (ie the columns) with carray:
ac[indice_vector] is 1000x slower than a[indice_vector] !
ac[bool_vector] is 4.5x slower than a[bool_vector]
ac.sum() is 2.8x slower than np.sum(a) (np.sum(ac) doesn't work)
bincount(ac) is 100x slower than bincount(a)
ca.eval(expr, out_flavor='numpy') is 2x slower than ne.evaluate(expr)
ca.eval(expr) is 6x slower than ne.evaluate(expr)
- chunking to simulate any size of data:
x alignment will be hard to do right *when there is ranking involved*
because if we simply apply alignment to each chunk:
* the individual paths would potentially be much worse, because in each
chunk it would take the x% best scoring individuals *but* the best in
a chunk could very well be the worst of another chunk and thus wouldn't
be selected if the whole population was simulated at once.
* the total number would be slightly less precise because we would round
to integers in each chunk instead of for the whole population. This can
be mitigated/corrected by storing the error in each chunk and adapt
the target in the next chunk
* if there is no deterministic ranking, it should be as good as a single
block simulation
* if the ranking score has only a small set of possible values,
we could partition individuals using their values, eg with integer:
partition 1: values 0-10
partition 2: values 11-20
etc...
This should allow to treat/sort chunks separately without having to
reconstruct the complete array in memory (see mapreduce sort).
However, I don't think this is the case here, so I'm stuck...
x matching is problematic
x new is problematic as it breaks the implicit sort by id of the table
> I have to "block" other processes during the new. The new itself can
> be done progressively as it doesn't modify any variable
x remove seem to be ok
? data import from within the console?/inside a simulation
? use more python-like syntax: use = instead of :
- review field def: do input fields need to be declared at all?
I could collect variables from all expressions and check that they are
all present in the input file.
however, some information about input fields will need to be defined anyway:
enumerations, ranges, default values, ... (unless I store all that in the
input file metadata)
- review the whole filter concept. The missing "else" value is often a problem,
but can we avoid it? do we want to avoid it?
- review align() function/alignment process (I don't want to code anything just
yet, I just want to know if the current syntax will need to be modified).
* Q: should it return false for all the persons which are not selected? instead
of the current weird situation: False if stored in a temp variable,
otherwise don't modify those which are not selected.
A: yes!!
* Q: what would the syntax be for multi-value alignment (when there are more
than two possible outcomes): align([2, 3, 4], ... ?
A: see al_p_multi_value.csv
* Q: so, I have a format in which the user can express the needed
proportions and I can easily compute the actual (integer) number of
individuals needed for each value. Now, how does the user specify
a "hint" of who should get which value? and how do I implement that?
A: one way to do that (and that is what is apparently done for
multinomial logistic regressions), is to have the user provide one
score expression for each possible outcome value.
This assumes all score expressions use the same "scale", otherwise the
some outcomes will not get the individuals with the highest score for
that outcome
outcome = np.empty(ctx_length)
outcome.fill(missing_value)
for outcome_value, score_values in zip(...):
sorted_indices = score_values.argsort()
for idx in sorted_indices:
if score_values[idx] > other_score[other outcomes]:
outcome[idx] = outcome_value
num_taken += 1
elif score_values[idx] == other_score[other outcomes]:
# break ties randomly
outcome[idx] = choice(outcomes[score == score_values[idx]])
num_taken += 1
if num_taken == target_num:
break
# optimization: delete individuals that were taken for this outcome,
# so that other outcomes don't have to compare with it
- implement many2many
* transparently create extra array and table with 3 fields:
- period
- side1_id
- side2_id
* store a "pointer" to the array into the Link instance
* store a "pointer" to the table in ???
* adapt all "link" functions
* how do we set those relationships?
via actions?
- add_child: append(children, child_id)
- remove_child: remove(children, child_id)
or via methods on the link (I think I prefer this):
- add_child: children.append(child_id)
- remove_child: children.remove(child_id)
- propagate missing values for int columns
using floats everywhere should work (even for "id" and "link" columns (it
would only limit the number of individuals per entity to 2^24 = ~16 millions
for float32 or 2^53 = ~9 * 10^15 for float64)
- propagate missing values for boolean columns (see test_missing.py)
one option would be to use a boolean mask (and possibly masked arrays in
numpy) internally/for computations, but use special values in the output
(because that's much more readable)
if using floats / NaN:
----------------------
a & b -> a * b
a | b -> (a + b) > 0 # Wrong: > 0 loose the NaNs and NaN > 0 is False, not NaN
-> where(a1 + a2 == 2, 1, a1 + a2)
~a -> 1 - a
where(a, b, c)
-> a * b + (1 - a) * c # Wrong: that leaks NaNs in "unused branch"
-> where(a == NaN,
NaN,
where(a == 1.0, b, c)) # Wrong: can't compare to Nan
-> where(isnan(a), NaN, where(a == 1.0, b, c))
-> where(a != a, a, where(a == 1.0, b, c))
or
-> where(a == 1.0, b, where(a == 0.0, c, a))
or
-> add an new opcode nanwhere_dddd in numexpr:
nanwhere(a, b, c) => a != a ? a : (a == 1.0 ? b : c)
if using int / -1:
------------------
a & b -> where((a == -1) | (b == -1), -1, a * b)
a & b & c -> where((where((a == -1) | (b == -1), -1, a * b) == -1) | (c == -1),
-1,
where((a == -1) | (b == -1), -1, a * b) * c)
-> where((a == -1) | (b == -1) | (c == -1), -1, a * b * c)
a | b -> where((a == -1) | (b == -1), -1, a + b > 0)
a | b | c -> where((a == -1) | (b == -1) | (c == -1), -1, a + b + c > 0)
~ a -> where(a == -1, -1, 1 - a)
where(a, b, c) -> where(a == -1, -1, where(a == 1, b, c))
bool_expr * float_expr -> where(bool_expr == -1, nan, bool_expr * float_expr)
where(bool_expr, float_expr1, float_expr2)
-> where(bool_expr == -1, -1, where(bool_expr, float_expr1, float_expr2))
- delete "empty" households (dead persons)
OR
- implement cascading deletes/death on links
AND/OR
- some kind of integrity checks
- implement something like "by x, y, z" in stata, to do one groupby per
possible value of the expressions. eg "by gender" does one for gender is
False then one for gender is True.
http://www.ats.ucla.edu/stat/stata/faq/bys.htm
- "compress" in stata tries to autodetect the type of each field by
inspecting the data, and convert it to the smallest type which can handle
that data.
- integrate more stuff into the console:
* import
* merge
* drop
- compression should be user-configurable for the simulation output file
- allow input fields to not be in the output
* it might be a good idea to rework the way to declare fields:
input:
- earny: float
- wstatus: int
- wzas: int
- yob: int
output:
- cscar: int
- csw: float
- cswl1: float
- round(X) should return int
- warning for unused fields
- better error message when we try to set an unknown field in new()
- better error message when the input file does not contain any table for an
entity.
tables.exceptions.NoSuchNodeError: group ``/entities`` does not have a
child named ``test``
- better error message when a procedure's processes are missing ordering
(it is a dict, instead of a list of dict)
>> we need to to remove the {'predictor': xxx, 'expr': yyy} syntax before
>> we can do that
- better error message if unknown entity in simulation.processes (KeyError)
>> almost done for "init"
- better error message on invalid (unknown) process name in agespine
- better error message when an extra comma is used (see test.test_extra_comma)
- better error message when a temporary variable is used before it is computed
(the error is not caught by the parser since the variable exists)
KeyError: '(~to_give_birth)\nto_give_birth'
- Error messages containing utf8 chars are not displayed on notepad++ or
dos prompt consoles (py2exe is irrelevant, it just removes the intermediate
lines of code from the traceback but doesn't change the last lines).
Example: syntax error if there is a comment containing non-ascii chars inside
an expression.
- watches in step by step mode (which are recomputed/displayed automatically
after each step)
- macros using other macros. This will need:
1) pre-parse macros to compute which macro depend on which other macro
2) do a topological sort on them
3) parse them in order
- labeled enumerations
eg workstate: 1: "unemployed, 2: "employee", 3: "civserv", ...
* display the labels in show/groupby instead of the numeric values
? use the labels in "read" expressions (instead of using macros)
macros contain the == and can contain < or >, which is more generic,
so I'm not sure it would be an improvement
* use labels when setting new values (this could be done with macros, but
that's not how they use it now)
eg: workstate: "if(RETIREMENTAGE, 10, workstate)"
workstate: "if(RETIREMENTAGE, RETIRED, workstate)"
RETIREMENTAGE: "((gender) and (age >= 65)) or ((not gender) and (age >= WEMRA))"
RETIREMENTAGE: "age >= if(gender, 65, WEMRA)"
civilstate: {1: 'single', 2: 'married', 3: 'divored', 4: 'widow'}
civilstate: Enum(single=1, married=2, divorced=3, widow=4)
# this is shorter but is error-prone if we use external data for that
# field (which is usually the case)
civilstate: Enum('single', 'married', 'divored', 'widow')
ISSINGLE: civilstate.is('single')
ISSINGLE: civilstate == civilstate.single
- numpy methods (min/max/...) on zero-length array
- implement "log" action = show in a file
- lastpresent(col[, condition]) -> returns the last value which is not missing
the problem is that if even a single individual never had any value for that
column, it will go back all the way to the first period
- initialdata: s/false/value
- allow setting/creating temporary variables in interactive console
- make globals writeable
- line numbers in error messages.
http://stackoverflow.com/questions/13319067/parsing-yaml-return-with-line-number
- one filter, several actions.
eg, when someone dies, you want to change more than one field, and it
is a bit tedious to need to repeat the filter
same comment for divorce
- add a concept of "per individual" constant (ie not saved each period)
(eg errsal)
- expand variable definition: min, max values, default value, enum, ...
- new individuals should be able to:
* use variables' default values
- explicit function to check if a value is present:
where(ispresent(err), err, normal(0, 1))
we can't use != NaN because that wouldn't work for int and bool
or even floats because NaN != NaN!
- infer expressions types systematically so that we can do more clever stuff:
some type checking, correct type for missing values when using filter, and
missing values handling in general.
- store some metadata in .h5 (list of entities, start_period, links, ...)
- implement other historic functions (with missing values):
* duration limited in time: eg: duration(xxx, 5): max 5 periods
* durationcurr(variable): duration of the current value of the variable
* tavg with a filter: only count when it satisfy a filter (eg > 0)
* currperiodstart(variable): returns the time period the current value
of the variable started
- test/integrate sumatra to launch simulations
- add "limit" argument to dump()
- prevent writing too many rows through show(dump())
bug fixes
=========
- if an entity is only used through links (no process is run directly on that
entity), the data for that entity is not loaded!
- nan -> 0 for sum in groupby (see numpy Ticket #1123)
https://github.com/numpy/numpy/issues/1721
- fix the "contextual filter" mess: it is taken into account for grpgini &
grpsum but not grpavg. The problem is that implementing it for grpavg would
break difficult_match for example.
I think the best way out of this is to ignore the contextual filter in all
aggregate functions. But in that case, what about ctx filter in new(),
matching(), ...?
- fix test data ("small"): all men with 18 <= age <= 90 are married, making
the matching test pointless in the first period
optimisation
============
- try to compress data using google snappy, quicklz or lz4 (they seem even
faster than fastLZ on which blosc is based)
- improve numexpr caching algorithm (LRU), and/or increase its size and/or
cache expressions ourselves. Some long expressions take a lot longer to
"compile" than to compute on small arrays. Eg minr & friends.
- store our indices in hdf file (id_to_rownum) so that we wouldn't have to
compute them for each simulation.
OR
use pytables built-in indexing engine (pytables 2.3 and later)
- optimise memory allocation by building a dependency graph between variables
and:
* free temporary variables as soon as they are not needed anymore
* reorder variables computation (respecting dependencies) to minimize the
number of temporary variables needed at any one point. See what numexpr
does to not duplicate effort.
> that might make logging & debugging harder, so it should be an option
- optimise "normal" expressions by collecting variables, and replacing them
with their definitions when they are temporary variables, unless that
particular variable is used more than N times. (testing should help us
determine what value to use for N)
-> this can only be done for variable which do not change between the place
they are defined and the one they are used
-> watch out for random functions !
- optimise expressions by factoring out common sub expressions (which are used
more than N times) in temporary variables
- optimise tavg by storing partial sum in an hidden field, and compute
the mean like this: (sum_last_period + period_value) / num_periods
- optimise duration by using duration for last period if present
- do not repartition data for each alignment and each period, but keep the
groups from period to period and update them when the linked variable
changes. Or, implement a weaker form of this: cache only within a
period, so that only the first alignment with a particular combination of
variables would do the partitioning. eg in MIDAS alignments currently
(10/12/2012) use only very few different columns:
* period: 10x
* age, period: 3x (birth, dead_f, dead_m)
* agegroup_civilstate, period: 2x (divorce, separate)
* agegroup_civilstate, civilstate: 2x (mmkt_f, mmkt_m)
* agegroup_work, period: ~22x !!! (all the others)
furthermore agegroup_work is computed at the start of the period, and
not modified afterwards
- translate more performance-critical code to C/Cython
? if links are accessed more often than they are modified, it could make sense
to store row indices in memory (and possibly on disk?), instead of ids.
It would make both end-of-period data storage (would have to translate back
to ids) and lifecycle functions (new and remove) slower.
? try alternate algorithm for RemoveIndividuals: simply reindex, like I do it
elsewhere. It might cause a problem if the last ids (newly borns) die.
it seems to be a tad slower in fact. However, it might be a better starting
point to translate in C if that is ever necessary.
? make numexpr able to compute several expressions at the same time
eg ne.evalutate(["a + (a * b)", "(a * b) * 5", {'a': [1, 2], 'b': [3, 4]})
>> [3, 10], [15, 40]
>> this would be limited to expressions which do not "recompute" any of
>> the variable used
>> if it is the only factorization, it would limit some of it, because if
>> some sub expression is common between one expression of the group and
>> one which is not in the group, there wouldn't be any factorization
Ideally, I could feed targets too, so that numexpr would be able to do
the whole thing by itself:
eg ne.evalutate([('c', 'a + (a * b)'),
('d', '(a * b) * c'),
('a', 'd + 1'),
('c', 'a * 2')], {'a': [1, 2], 'b': [3, 4]})
>> {'a': ..., 'c': ..., 'd': ...}
cleanup
=======
- use proper logging & warnings (use logging and warnings modules)
- move expr to its own directory and split it. Move properties there too.
- __entity__ vs .entity
- change constructor arguments to Entity, so that
1) it is shielded from YAML specifics
2) I can use the real thing in test_expr instead of FakeEntity
- factorize/simplify __str__ methods in some way: by using multi-inheritance?
- merge src/expr with tools/expr
? do not store the array/data into the entity. The entity would only be a
descriptor.
GUI
===
- make liam2 useable within Ipython notebook
* if it doesn't exist yet, develop a web-based generic "table" viewer for
Ipython (using their archicture). This might need (or not) developing an
ipython extension
* on the server side, implement an hdf5 backend
- assess whether we can get inspiration (or even code) from spyke viewer:
http://neuralensemble.blogspot.be/2012/11/spyke-viewer-020-released.html
- try Kivy (kivy.org)
- integrate with matplotlib for normal graphics
? glumpy for "live" visualisation during the simulation
non feature tasks
=================
? check the situation and possibly document that csv files need to be encoded
in utf8.
- document what each regression function does
- move all these TODO to trac tickets
- add explanation of what you can do with the website both on the site and
in the doc
- reorder the doc to be more logical (like in my presentations)
- add unit tests
projects to keep an eye on
==========================
- Julia (http://julialang.org): a language specifically tailored for
scientific computing. It seem to have a very interesting feature set
(optional typing, fast resulting code, builtin parallelisation primitives,
...) but a somewhat ugly syntax. What I don't know though is whether its
"stdlib" is complete enough for our needs (csv input, hdf output, plotting,
...).
- SciDB: http://www.scidb.org
- pandas: I could kill half of my code and it would bring for free many
of the features I want to develop (and more), but this would need such a
major refactoring that I am uneager to start it :)
- Numba: early tests seem very promising.
- Parakeet: jit for a subset of python. very similar to numba, but more limited
in scope
- Blaze (http://blaze.pydata.org/) is a generalisation of ndarray with more
data layouts -- potentially distributed), a delayed-evaluation system
(like I have done in liam2) and a built-in code generation framework based
on LLVM (using Numba I think). Documentation is a bit scarce at this point,
but from my quick glance at it, it seems to be exactly what I have been
dreaming for but would have been unable to achieve by myself.
- Odin (equivalent of Blaze from Enthought)
- SymPy. They use mpmath and I wonder how fast it is.
- bottleneck (http://pypi.python.org/pypi/Bottleneck)
nansum, nanmin, nanmax, nanmean, nanstd
it will probably be integrated back into numpy/scipy at some point but it
will probably take a while, so let's use this in the mean time
- pydap (http://www.pydap.org)
access many kinds of data formats using a uniform interface
- corepy
- copperhead (targets NVIDIA GPUs & multi core CPUs)
http://copperhead.github.io/
- Theano (http://deeplearning.net/software/theano/)
it seems to be a PITA to install though (no pre-built packages, needs a
compiler, BLAS, etc...). It is only integrated with CUDA (ie NVidia)
- PyOpenCL: OpenCL seem pretty powerful and more standardised than CUDA
(also works on some CPUs, not only GPUs).
- Clyther: translates python/numpy code to OpenCL
- pythran: python to C++ compiler
- shedskin: python to C++ compiler
- nuitka: python to C++ compiler
- scipy.weave.blitz. It seems a bit limited in functionality but it uses
blitz++ (http://www.oonumerics.org/blitz/) internally which seems to contain
all I need and more, so adding the missing functionality might not be too
hard. It seems to be really slow to compile though, so it'll probably be
only worth it when using very large arrays.
- ORC (but I would need to create python bindings for it)
http://code.entropywave.com/projects/orc/
or any of the other C JIT referenced in the external links at:
http://en.wikipedia.org/wiki/Just-in-time_compilation
undecided
=========
? make specifying output field types optional (I can compute them):
listing the fields is technically enough, though having the type make things
easier and we can have some type checks
? make specifying input fields optional (we could still check
whether needed fields are present or not (by analysing the simulation
code), but we couldn't check that the type of fields is "correct". We would
also need to postpone "compiling" to numexpr (which uses types) to after
the input data file is loaded (or at least known and analysed).
I like the idea, because it brings us closer to Python but it would be much
work for only a small benefit.
? use stata regressions directly
* would introduce some kind of dependency on Stata which is not a good thing
IMO -> convert them first?
? importer: import all columns, except a few
? allow setting temporary variables in new()
I doubt it is a good idea...
? somehow make "chenard" algorithm (used in align_abs(link=)) available in
standard alignment. I would have to make the iterator run on individuals
directly instead of hh. The code would be much simpler because we wouldn't
need num_unfillable, num_excedent, etc...
? two different values for missing links:
* linked to none/no individual = -1
* missing value = -2
A potential problem (even if the distinction has value -- I am unsure at
this point) is that modellers will probably not have the additional
information in their input data.
? temporaries which are carried from one period to another
ex: "progressive" tsum, or "progessive" duration:
#minr61_c: "duration(minr61_d)"
minr61_c: "if(minr61_d, minr61_c + 1, minr61_c)"
we do not (necessarily) need to store the value for each year (unless we
want to inspect it after-the fact)
=> probleme for the first period
? m2o link in aggregate link
eg, hh_nch: countlink(househould.persons)
I am not sure it's a good idea after all, because that would introduce
two ways to do the same thing and the other way (should) already work(s):
hh_nch: household.get(countlink(persons))
? a: "where(b, c[, d])" -> d implicitly = a
? specific choice method for boolean: boolchoice(0.3)
instead of: choice([True, False], [0.3, 1-0.3])
? somehow "merge" grpxxx and xxxlink functions (at least for the user, not
necessarily technically):
grpcount() -> count()
countlink(persons) -> count(persons)
>>> this is problematic because count() returns a scalar while
>>> count(persons) return a vector
grpsum(age) -> sum(age)
sumlink(persons, age) -> [sum(person.age for person in household.persons)
for each household]
After some thought, I think having:
grpcount() -> count()
countlink(persons) -> persons.count()
makes more sense.
? at the start of each period, wipe out the array (set all to missing)
=> need to use lag explicitly
? allow processes to be run only at specific periodicity: eg. some monthly,
some quarterly, and some annually
? store alignment tables in hdf input & output files
* it seems like a good idea to store them in the output file, so that we can
know for sure what data was used to generate the output
* however requiring alignment data to be in the input hdf file, might be
less convenient for the modeller who might have his source data in csv
and would thus require an extra import step.
* on the other hand, having all input data in one file is convenient if they
want to pass their simulation around
? implicit "dead" field. Not sure it is really necessary, as some simulations
might not involve the creation or death of individuals. And some people
might want to name it differently: alive, ...
? implicit age: I don't like the idea but Geert would like to have it.
? implement data generator using variable definitions + custom code where
it is absolutely needed
? explicit "fill missing values" so that you can do operations between vectors
of a past period.
? OR, we could "timestamp" each expression and convert only when using both
old and current ones in the same expression
? automatically add dependencies to simulation? (eg add person.difficult_match
when person.partner_id is computed)
? allow lag to use temporary variables, though it doesn't make much sense.
it's better to only allow macros
? move expression optimisation/simplification to numexpr
? "new" action is limited because it can only have one parent, it might be
interesting to have an arbitrary number of them. eg:
newbirth: "new('person', filters={'mother': to_give_birth & ~male,
'father': to_give_birth & male},
mother_id=mother.id,
father_id=father.id,
male=logit(0.0),
partner_id=-1)"
however, this might be useless/impractical, as eg, the father filter above
is wrong, and if an individual is created from several parents, they will
usually (always?) need to be matched beforehand, so one or several fields
or links will be available to get to the other parents. Consequently, this
syntax might be more appropriate:
newbirth: "new('person', filter={'mother': to_give_birth},
links={'father': mother.partner},
mother_id=mother.id,
father_id=father.id,
male=logit(0.0),
partner_id=-1)"
... but in that case, the current syntax works as well.
? Q: use validictory instead of my custom validator?
http://readthedocs.org/docs/validictory/
A: probably not. I prefer my own solution, as the syntax is more concise and
(dare I say) more pythonic. It could be extended if needed:
- Or(a, b c) to have both enumeration of values or alternate syntaxes:
Or(str, {'type': str, 'initialdata': bool})
Or('many2one', 'one2many')
- List(x, y, z, option1=value1) if I need extra options for lists
- Dict({}, option1=value1) if I need extra options for dicts