I think it helps to think about two different testing questions:
1. Does the algorithm implement the math correctly?
2. Will the math perform correctly against the real world?
The first question is a traditional software QA question, which can be answered with "toy" test data, using fewer variables. It has a "yes or no" answer.
The second question is more of a statistical performance evaluation, where you're asking how well the algorithm performs against some relatively realistic data set. You could test against live data every time, but then when you get a borderline result, and want to tweak the algorithm and re-run the test, it's hard to tell if any changes are due to your tweaking, or due to changes in the live data.
On the other hand, you don't want to test against obsolete data based on live data from several years ago, either. I suspect what might work best is to maintain several corpi of data to test against, and stagger the replacement schedule so that one new corpus can be validated against test results from several older ones. If the new corpus checks out, you retire the oldest.
I agree with "Jeremy Leader" - it helps to first define what it is that is being tested. From my understanding of the post, it appears the main testing question being asked is, "does this algorithm learn correctly?" , with the sticking points being that, to test if something actually learns, you need to 1) define the parameters that indicate the system has demonstrated it has learned, and 2) define them in such a way that obtaining said parameters fits the time/resource constraints of the testing/development/release processes.
I think point 2) is the kicker - tests need to run as quickly as possible. Therefore, I would say firstly it is important to NOT run the full gamut of learning-related tests unless the algorithm itself has been changed. Sounds like a no-brainer, but the post mentions wanting to "release new features fast", and the first thing I thought was, "yeah, but are those features related to the algorithm?" With a bit of planning, changes to the algorithm itself could be tested in isolation of other features, and therefore be tested by the time they are integrated.
The questions around what sized corpus to use, whether to cull importance factors, how often to update the corpus depend on how you define 1). Again this may be obvious, but the reason to have a huge corpus is because it allows you to identify tiny, but significant changes in the behavior of the algorithm. Take a few snapshots of significant changes to the algorithm, and run them over the entire corpus. Are the variations meaningful? You might be able to [automatically] prune the corpus by analyzing these results.
I see value in having a baseline data set for trending purposes. Live data evolves, but if your corpus is already very large I wouldn't worry too much about it not capturing some magical new aspect out there. Perhaps there is some way to identify and obtain "new" live data that is quantifiably different than the data that already exists in the corpus, and valuable for that reason?
Overall, I tend to think that running a full suite of tests using all importance factors and the entire corpus is necessary for critical algorithms. Maybe the development and release processes could be tweaked so that, as far as possible, these tests are not the bottleneck.
There's an interesting related question about writing unit tests for non-deterministic code in general.
If you have a stochastic algorithm that produces a different result each time -- because it's sampling from a distribution, for example -- how do you write tests that strongly test the code when you don't know what output to expect?
In this example you can only see if the sampling code is correct by doing it lots of times and seeing if the results fit the expected distribution. Which goes against the principle of making unit tests small and fast.
I work in bioinformatics, and I'm doing my best to champion test-driven development, but people do come up with examples like these where the normal model doesn't fit so well.
I'd be interested to hear people's thoughts on this.
@Andrew I work on agent based modeling software that uses stochastic algorithms so I'm familiar with the situation you're in. For unit testing, the trick is to encapsulate the sampling from the distribution functions inside your own functions/classes and injecting these into classes that consume them. This way you can test code that consumes non-deterministic functionality by injecting a mock/stub object that returns predefined values. This way you can test the results as you will know apriori the values returned by the distribution function. I'd suggest you check out Misko Hevery's posts on this blog (as well as his own) for some great articles on Dependency Injection.
The approach you listed of doing lots of iterations and checking that it fits the specific distribution function would be a functional test and not a unit test. I suggest you use this approach as well, but it's probably something that should be handled by QA. Though I'm gonna take a stab in the dark and guess that you're working a research department like me and don't have QA either.
Thanks for the suggestions, Mark. I'm a big fan of DI having started playing with Guice recently, but it's very easy to slip out of the OO paradigm and into procedural (=less modular) thinking when you're doing algorithmic stuff. Old habits etc.
You're right about the QA dept., the closest thing we have is the peer reviewers on the resulting journal articles, and (I suspect) they rarely even try the software which the article is about..!
I work on a library of Computational Intelligence algorithms and testing is one of the things that has been plaguing me for a long time.
There are effectively two different kinds of tests that are needed (similar to what Jeremy pointed out earlier in the comment list).
I eventually decided on making the algorithm completely deterministic by specifying the seed for the RNGs in order to test the validity of the algorithm as a whole. Many tests, however, exist to test the smaller components in isolation.
We use setter based injection, with a large amount of success so far. This makes sure that the objects are complete and we mock out what we need to in order to ensure that the behavior is as expected.
wouldn't help normalize the data if you remove the Xi with excessively high/low (maybe just the low) values of Wi?
ReplyDeleteas for changing the input data, I would say, update it as often as possible
I think it helps to think about two different testing questions:
ReplyDelete1. Does the algorithm implement the math correctly?
2. Will the math perform correctly against the real world?
The first question is a traditional software QA question, which can be answered with "toy" test data, using fewer variables. It has a "yes or no" answer.
The second question is more of a statistical performance evaluation, where you're asking how well the algorithm performs against some relatively realistic data set. You could test against live data every time, but then when you get a borderline result, and want to tweak the algorithm and re-run the test, it's hard to tell if any changes are due to your tweaking, or due to changes in the live data.
On the other hand, you don't want to test against obsolete data based on live data from several years ago, either. I suspect what might work best is to maintain several corpi of data to test against, and stagger the replacement schedule so that one new corpus can be validated against test results from several older ones. If the new corpus checks out, you retire the oldest.
I agree with "Jeremy Leader" - it helps to first define what it is that is being tested. From my understanding of the post, it appears the main testing question being asked is, "does this algorithm learn correctly?" , with the sticking points being that, to test if something actually learns, you need to 1) define the parameters that indicate the system has demonstrated it has learned, and 2) define them in such a way that obtaining said parameters fits the time/resource constraints of the testing/development/release processes.
ReplyDeleteI think point 2) is the kicker - tests need to run as quickly as possible. Therefore, I would say firstly it is important to NOT run the full gamut of learning-related tests unless the algorithm itself has been changed. Sounds like a no-brainer, but the post mentions wanting to "release new features fast", and the first thing I thought was, "yeah, but are those features related to the algorithm?" With a bit of planning, changes to the algorithm itself could be tested in isolation of other features, and therefore be tested by the time they are integrated.
The questions around what sized corpus to use, whether to cull importance factors, how often to update the corpus depend on how you define 1). Again this may be obvious, but the reason to have a huge corpus is because it allows you to identify tiny, but significant changes in the behavior of the algorithm. Take a few snapshots of significant changes to the algorithm, and run them over the entire corpus. Are the variations meaningful? You might be able to [automatically] prune the corpus by analyzing these results.
I see value in having a baseline data set for trending purposes. Live data evolves, but if your corpus is already very large I wouldn't worry too much about it not capturing some magical new aspect out there. Perhaps there is some way to identify and obtain "new" live data that is quantifiably different than the data that already exists in the corpus, and valuable for that reason?
Overall, I tend to think that running a full suite of tests using all importance factors and the entire corpus is necessary for critical algorithms. Maybe the development and release processes could be tweaked so that, as far as possible, these tests are not the bottleneck.
There's an interesting related question about writing unit tests for non-deterministic code in general.
ReplyDeleteIf you have a stochastic algorithm that produces a different result each time -- because it's sampling from a distribution, for example -- how do you write tests that strongly test the code when you don't know what output to expect?
In this example you can only see if the sampling code is correct by doing it lots of times and seeing if the results fit the expected distribution. Which goes against the principle of making unit tests small and fast.
I work in bioinformatics, and I'm doing my best to champion test-driven development, but people do come up with examples like these where the normal model doesn't fit so well.
I'd be interested to hear people's thoughts on this.
Andrew.
@Andrew
ReplyDeleteI work on agent based modeling software that uses stochastic algorithms so I'm familiar with the situation you're in. For unit testing, the trick is to encapsulate the sampling from the distribution functions inside your own functions/classes and injecting these into classes that consume them. This way you can test code that consumes non-deterministic functionality by injecting a mock/stub object that returns predefined values. This way you can test the results as you will know apriori the values returned by the distribution function. I'd suggest you check out Misko Hevery's posts on this blog (as well as his own) for some great articles on Dependency Injection.
The approach you listed of doing lots of iterations and checking that it fits the specific distribution function would be a functional test and not a unit test. I suggest you use this approach as well, but it's probably something that should be handled by QA. Though I'm gonna take a stab in the dark and guess that you're working a research department like me and don't have QA either.
-Mark
Thanks for the suggestions, Mark. I'm a big fan of DI having started playing with Guice recently, but it's very easy to slip out of the OO paradigm and into procedural (=less modular) thinking when you're doing algorithmic stuff. Old habits etc.
ReplyDeleteYou're right about the QA dept., the closest thing we have is the peer reviewers on the resulting journal articles, and (I suspect) they rarely even try the software which the article is about..!
Andrew.
I work on a library of Computational Intelligence algorithms and testing is one of the things that has been plaguing me for a long time.
ReplyDeleteThere are effectively two different kinds of tests that are needed (similar to what Jeremy pointed out earlier in the comment list).
I eventually decided on making the algorithm completely deterministic by specifying the seed for the RNGs in order to test the validity of the algorithm as a whole. Many tests, however, exist to test the smaller components in isolation.
We use setter based injection, with a large amount of success so far. This makes sure that the objects are complete and we mock out what we need to in order to ensure that the behavior is as expected.