zirias (on snac)<a href="https://snac.bsd.cafe?t=programming" class="mention hashtag" rel="nofollow noopener" target="_blank">#programming</a> question, TL;DR: How to test for an (approximately) uniform <a href="https://snac.bsd.cafe?t=distribution" class="mention hashtag" rel="nofollow noopener" target="_blank">#distribution</a>?<br><br>Today at work, I created a piece of code that should <a href="https://snac.bsd.cafe?t=partition" class="mention hashtag" rel="nofollow noopener" target="_blank">#partition</a> a stream of data entities based on some string keys of unknown format. The only requirements were that the same key must always be assigned to the same partition and the distribution should be approximately uniform (IOW all partitions should have roughly the same size). My approach was to apply a non-cryptographic <a href="https://snac.bsd.cafe?t=hash" class="mention hashtag" rel="nofollow noopener" target="_blank">#hash</a> function to the keys (defaulting to <a href="https://snac.bsd.cafe?t=xxhash3" class="mention hashtag" rel="nofollow noopener" target="_blank">#xxhash3</a>), XOR-fold the hash down to 32 bits and then take this as an unsigned integer modulo the desired number of partitions.<br><br>I normally only code my private projects (as a software architect, I rarely have the time to touch any code at work, unfortunately), and there, I'd certainly test something like this on some large samples of input data, but probably just once manually. 🙈<br><br>But for work, I felt this should be done by a <a href="https://snac.bsd.cafe?t=unittest" class="mention hashtag" rel="nofollow noopener" target="_blank">#unittest</a>. I also think at least one set of input data should be somehow "random" (while others should contain "patterns"). My issue is with unit-testing the case for "random" input. One test I wrote feeds 16k GUIDs (in string representation) to my partitioner configured for 13 partitions, and checks that the factor between the largest and smallest partitions remains < 2, so, a <b>very</b> relaxed check. Still doubt remains because there's no way to guarantee this test won't go "red" <i>eventually</i>.<br><br>I now see several possible options:<br><ul><li>just ignore this because hell freezing is more likely than that test going red ...</li><li>don't even attempt to test the resulting distribution on "random" input</li><li>bite the bullet and write some extra code creating "random" (unicode, random length within some limits) strings from a PRNG which will produce a predictable sequence</li></ul><br>What do you think? 🤔 The latter option kind of sounds best, but then the complexity of the test will probably exceed the complexity of the code tested. 🙈<br>