IPv6 regex – Vernon Mauery

I spent too much time today playing with IPv6 stuff that I didn’t have any time to work on my latest time sink, Pyrobox. I will have to write about that some other time.

For now, I wanted to get this out there. I was curious about how easy it was to confirm that a string is a valid IPv6 address. It turns out that it is not so simple, thanks to the “space saving” techniques of zero folding that is used. Here are some examples of IPv6 addresses that are valid:

::                           unspecified address
::1                          localhost
fe80::219:7eff:fe46:6c42     link local address
::00:192.168.10.184          embedded IPv4 address

Yes, those are just some of the variety that was introduced that makes the protocol easier to use from a high level, but harder to implement and use from a low level. I mean, I tell my router to advertise IPv6 addresses and within seconds all the machines on the LAN have configured themselves with globally routable IPv6 addresses. Impressive. Yet so very *not* human friendly. There is no way around it, when you are dealing with 128 bits, it is a lot of data to put in a human readable format. That’s what the 8 sets of quad-byte segments are all about, human readability, but it is still so long, nobody will be able to spout off their IP address like they can with IPv4. Thankfully nameservers should help out with that. But sometimes we will have to deal with IPv6 addresses and I want to know how easy it is to recognize a true address from a fake.

I wrote a little ditty in python to help me conceptualize it. I found several sites that had similar regexes to this Ruby parser. I was somewhat dismayed to find out that while they successfully classify a bunch of the namespace correctly, they also have an infinite set of bad addresses they classify as good. Oops. Time to look at your regex handbook again, folks. To remedy their situation is not so easy though. Here is the python program I wrote that generates a regex:

 import re

def valid_ipv6(addr, debug=False):
	# we will build an array of matches and then join them
	a = []
	# the simplest match in the ipv6 address
	sm = r'[0-9a-f]{1,4}'
	for i in range(1,7):
		a.append(r'A(%s:){1,%d}(:%s){1,%d}Z' % (sm, i, sm, 7-i))
	a.append(r'A((%s:){1,7}|:):Z' % sm)
	a.append(r'A:(:%s){1,7}Z' % sm)
	a.append(r'A((([0-9a-f]{1,4}:){6})(25[0-5]|2[0-4]d|[0-1]?d?d)(.(25[0-5]|2[0-4]d|[0-1]?d?d)){3})Z')
	# support for embedded ipv4 addresses in the lower 32 bits
	ipv4 = r'(25[0-5]|2[0-4]d|[0-1]?d?d)(.(25[0-5]|2[0-4]d|[0-1]?d?d)){3}'
	a.append(r'A((%s:){5}%s:%s)Z' % (sm, sm, ipv4))
	a.append(r'A(%s:){5}:%s:%sZ' % (sm, sm, ipv4))
	for i in range(1,5):
		a.append(r'A(%s:){1,%d}(:%s){1,%d}:%sZ' % (sm, i, sm, 5-i, ipv4))
	a.append(r'A((%s:){1,5}|:):%sZ' % (sm, ipv4))
	a.append(r'A:(:%s){1,5}:%sZ' % (sm, ipv4))
	bigre = "("+")|(".join(a)+")"
	if debug:
		for i in range(len(a)):
			r = re.compile(a[i], re.I)
			if r.search(addr):
				print a[i]
		print "n%s" % bigre
	bigre = re.compile(bigre, re.I)
	return bigre.search(addr) and True

if __name__ == '__main__':
	import sys
	if len(sys.argv) < 2:
		print "usage: %s "%sys.argv[0]
		sys.exit(1)
	
	if valid_ipv6(sys.argv[1], True):
		print "valid"
		sys.exit(0)
	else:
		print "invalid"
		sys.exit(1)

When it runs, it can print out the final regex (which embodies the entire IPv6 address language as far as I can tell). Here is that regex:

(A([0-9a-f]{1,4}:){1,1}(:[0-9a-f]{1,4}){1,6}Z)|
(A([0-9a-f]{1,4}:){1,2}(:[0-9a-f]{1,4}){1,5}Z)|
(A([0-9a-f]{1,4}:){1,3}(:[0-9a-f]{1,4}){1,4}Z)|
(A([0-9a-f]{1,4}:){1,4}(:[0-9a-f]{1,4}){1,3}Z)|
(A([0-9a-f]{1,4}:){1,5}(:[0-9a-f]{1,4}){1,2}Z)|
(A([0-9a-f]{1,4}:){1,6}(:[0-9a-f]{1,4}){1,1}Z)|
(A(([0-9a-f]{1,4}:){1,7}|:):Z)|
(A:(:[0-9a-f]{1,4}){1,7}Z)|
(A((([0-9a-f]{1,4}:){6})(25[0-5]|2[0-4]d|[0-1]?d?d)(.(25[0-5]|2[0-4]d|[0-1]?d?d)){3})Z)|
(A(([0-9a-f]{1,4}:){5}[0-9a-f]{1,4}:(25[0-5]|2[0-4]d|[0-1]?d?d)(.(25[0-5]|2[0-4]d|[0-1]?d?d)){3})Z)|
(A([0-9a-f]{1,4}:){5}:[0-9a-f]{1,4}:(25[0-5]|2[0-4]d|[0-1]?d?d)(.(25[0-5]|2[0-4]d|[0-1]?d?d)){3}Z)|
(A([0-9a-f]{1,4}:){1,1}(:[0-9a-f]{1,4}){1,4}:(25[0-5]|2[0-4]d|[0-1]?d?d)(.(25[0-5]|2[0-4]d|[0-1]?d?d)){3}Z)|
(A([0-9a-f]{1,4}:){1,2}(:[0-9a-f]{1,4}){1,3}:(25[0-5]|2[0-4]d|[0-1]?d?d)(.(25[0-5]|2[0-4]d|[0-1]?d?d)){3}Z)|
(A([0-9a-f]{1,4}:){1,3}(:[0-9a-f]{1,4}){1,2}:(25[0-5]|2[0-4]d|[0-1]?d?d)(.(25[0-5]|2[0-4]d|[0-1]?d?d)){3}Z)|
(A([0-9a-f]{1,4}:){1,4}(:[0-9a-f]{1,4}){1,1}:(25[0-5]|2[0-4]d|[0-1]?d?d)(.(25[0-5]|2[0-4]d|[0-1]?d?d)){3}Z)|
(A(([0-9a-f]{1,4}:){1,5}|:):(25[0-5]|2[0-4]d|[0-1]?d?d)(.(25[0-5]|2[0-4]d|[0-1]?d?d)){3}Z)|
(A:(:[0-9a-f]{1,4}){1,5}:(25[0-5]|2[0-4]d|[0-1]?d?d)(.(25[0-5]|2[0-4]d|[0-1]?d?d)){3}Z)

Yes, I realize that is one big regex (note that the lines should all really be concatenated, but I broke it apart at the | points to make it easier to read). But I think it is more complete and correct than other regexes that google helped me to find. If anyone knows a better way to do this using some regex fu, please let me know.