August 14, 2013

Efficient character filtering in C

Filed under: Programming — Tags: , — admin @ 3:17 pm

In this article, I would like to talk about how to filter a set of characters from a string in C language efficiently.

Let’s start by defining the function prototype.

1
2
3
4
5
6
/*
 * ptr    : input string to be filtered.
 * inp    : set of characters to define filter.
 * return value : void.
*/
void omit_chars(char *ptr, const char *inp);

First approach may be like this,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
void omit_chars(char *ptr, const char *inp) 
{
    long shift = 0;
    char lookup[256]; /* we have at most 256 chars for 8 bits */
 
    memset(lookup, 0, sizeof(lookup));
 
    while (*inp)
        lookup[*inp++] = 1;	/* mark this character to filter */
 
    do {
        if (lookup[*ptr]) 
            shift++;	/* skip marked characters by shifting copy offset */
        else if (shift) 
            *(ptr-shift) = *ptr;
    } while (*ptr++);
}

seems fast enough to filter efficiently. But, is it really efficient? Especially when we consider look-up table memory footprint. In essence we do not need to mark all bytes in a look-up table. We have 256 different characters, and all we need to know is whether this(these) filter character(s) is/are in the black list. That’s all.

Let’s re-factor our code as below,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
void omit_chars(char *ptr, const char *inp) 
{
    long shift = 0;
    uint32_t lookup[8]; /* 256 bits lookup table */
 
    memset(lookup, 0, sizeof(lookup));
 
    while (*inp) {
        lookup[(*inp)/0x20] |= 1 << ((*inp) & 0x001F); /* mark this character to filter */
        inp++;
    }
 
    do {
        if (lookup[(*ptr)/0x20] & (1<<((*ptr) & 0x001F)))
	    shift++;	/* skip marked characters by shifting copy offset */
        else if (shift) 
            *(ptr-shift) = *ptr;
    }  while (*ptr++);
}

Now, we can say it seems better, but on the other we added extra complexity into the code to set/read specific bit positions for associated characters to be filtered.

So this is a sort of trade-off one way to another, either we may chose lower memory footprint or faster look-up table access.

Any kinds of feedback are welcome!

Powered by WordPress