This is a presentation I did years ago, but I heard that there are still people using it as a reference. So here it is, slightly cleaned up. If you are writing systems that process email addresses in some form or anotehr you might want to read this.
1. What's in an Email Address?
RFC2822 Em@il @ddresses for Mere Mortals
Schalk W. Cronjé
@ysb33r
2. Why This Topic?
● Recurring bugs in software we build
● Lack of understanding at all levels
– Developers
– Testers
– Support People
● Assumptions made, without reading RFCs
● Understanding RFCs are not straightforward
– RTFM is difficult when TFM cannot be found
● We require a basic reference
3. Content
● Overview
● Local-part
● Domain-part
● Valid or not?
● The real world
4. Brave, brave RFC World
RRFFCC22882211 RRFFCC11003344
RRFFCC11003355
RRFFCC22882222
RRFFCC882211
RRFFCC882222
Domain name specification.
Restrictions on email
addresses at protocol levels.
Specifies layout of email
transmitted over internet.
Specifies format of email
address.
RRFFCC22004477
Encoding of 8-bit in
RFC2822 header
fields
RRFFCC33449900
Encoding international
domain names
RRFFCC11112233
((PPaarrttiiaallllyy uuppddaatteedd bbyy RRFFCC22882211))
Requirements for internet hosts
5. Address Format
Modern format
local-part @ domain-part
Historic format (RFC821/RFC2821)
source-route : local-part @ domain-part
6. RFC2822 Local Parts
● Unrestricted characters
0..9 a..z A..Z ! # $ % & ' * + - / = ? ^ _ ` | { } ~ .
● Quotable charactersq u( oted by “ )
< [ ( : @ ; ) ] > , non-ws-ctrl
● Illegal characters
All 8-bit.
● Whitespace
ws-ctrl illegal, only used for folding in headers
space character is valid if quoted
[ RFC2821: 4.1.2; RFC2822: 3.2, 3.4 ]
7. Local Payload
● Routing characters
– ! % have been used for local-routing in legacy
systems, including UUCP and MHS.
– Can be used to bypass routing in mis-configured
systems.
● Shell exploits
– | / ` $ have been used to attempt remote
command execution
8. Does Case Matter?
● Case is ignored in domain
ntaba.biz == ntaba.biz
● Strictly-speaking case matters in local-parts
schalk@ntaba.biz != ScHaLk@ntaba.biz
– Most MTAs ignore case
– RFC2821 discourages use of case as a
distinguishing factor
[ RFC2821: 2.4 ]
9. Does Size Matter?
● RFC2821 places limitations on length of local-part and
domain-part
– 64 characters for local-part
– 255 characters for domain-part
● This is normally not a problem for messages
transmitted across the internet, but can be problematic
for in-house applications or encoded email addresses
such as X.400.
● Many MTAs will now ignore this length restriction as
long as the overall SMTP protocol line length restriction
is not exceeded.
[ RFC2821: 4.5.3.1 ]
10. Domain Parts
● Can either be a RFC1035 domain or an address literal
● Valid characters for domain names:
a..z A..Z 0..9 -
● Subdomains separated by dot character.
● Subdomain may not start or end with dash.
● 255 characters max length.
● 63 characters max per subdomain.
● Cannot start or end in dot.
● Restriction of subdomain starting with digit have been
relaxed.
11. Address Literals
● Workarounds for when host names cannot
be resolved.
– @[protocol:host-address]
– IPv4: @[192.1.1.1]
– IPv6: @[IPv6:fe80::a00:20ff:fec2:2ef4]
● Protocol must be registered with ICANN.
[ RFC2821: 4.1.3 ]
12. International Domain Names
● Domain names not representable in US-ASCII
can be registered
● Such domain names cannot be handles by
DNS or existing protocols
● RFC 3490 describes the encoding/decoding
of such domain names from presentation to
protocol:
exämple.com => xn--example-cua.com
● Potential for phising
13. Valid or not?
schalk_cronje@ntaba.biz
● Valid even under strict RFC2822
interpretation
● Most punctuation are valid in local part,
including:
{$cha?k*cr%nje}@ntaba.biz
14. Valid or not?
schalk_cronje@[192.168.1.1]
● Yes, the domain part is an address-literal
● Acceptance of address-literals should be
configurable
– They can be security risks
– RFC2821 prefers usage of MX-based deliveries.
15. Valid or not?
schalk_cronje@192.168.1.1
● No, it is not an address-literal nor a valid
domain name.
● Some systems will attempt to deliver this by
passing the 192.168.1.1 to the domain
resolving subsystem, which in return will
simply return the IP address.
– This violates RFC1123
– This is a potential security risk.
[ RFC1123: 2.1 ]
16. Valid or not?
schalk_cronje@1967.com
● Not valid according to RFC1035
● Limitation lifted in RFC1123.
[ RFC1123: 2.1 ]
17. Valid or not?
schalk_cronje@#192168
● Valid in RFC821 for compatibility with
non-TCP/IP networks.
● Outlawed by RFC2821.
● Not supported by any modern MTA.
[ RFC821: 4.1.2; RFC2821: F.4 ]
18. Valid or not?
schalk_cronje@.ntaba.biz
● No, domain-part may not start with a dot.
[ RFC2822: 3.2.4 ]
19. Valid or not?
schalk_cronje@ntaba.biz.
● No, strictly RFC2822 states that domain-part
may not end with a dot.
● RFC1034 use the dot-ending to indicate
absolute domains (FQDN) in resource
records.
● Most systems will accept, resolve and deliver
this
[ RFC2822: 3.2.4; RFC1034: 3.1]
20. Valid or not?
schalk_cronje@ntaba..biz.
● No, consecutive dots are not allowed in
domain parts.
[ RFC2822: 3.2.4; RFC1034: 3.1]
21. Valid or not?
● No.
.schalk_cronje@ntaba.biz
schalk..cronje@ntaba.biz
– Local-parts may not start with a dot.
– Consecutive dots are not allowed in local parts.
● Pragmatically, many known MTAs don’t care
[ RFC2822: 3.2.4]
22. Valid or not?
schalk_cronje@lon_eng.ntaba.biz
● No, _ is not valid in domain names
● Some DNS servers will support this.
● Some sites do use th_e for internal systems.
● It remains illegal for internet operations
[ RFC2821: 4.1.3 ]
23. Valid or not?
schalk_cronje@lon_eng@ntaba.biz
● No, @ cannot be used unquoted in local
parts
“schalk_cronje@lon_eng”@ntaba.biz
schalk_cronje@lon_eng@ntaba.biz
[ RFC2822: 3.2.5, 3.4 ]
24. Local-part Quoting
● Quoting should only be used where
absolutely necessary
● Where a quoted-form have an unquoted
form...
– The two forms are equivalent
– The unquoted form should be used for
transmission
● Quoting is performed by enclosing local-part
in quotes or preceding a character by
backslash.
[ RFC2821: 4.1.2 ]
25. Valid or not?
<schalk_cronje@ntaba.biz>
● No, this is an envelope for email addresses
● The following is valid:
“<schalk_cronje>”@ntaba.biz
26. Valid or not?
schalk_O”cronje@ntaba.biz
● No, the double quote is a quoting character.
27. Valid or not?
schalk_O'cronje@ntaba.biz
● Yes, apostrophe is valid in unquoted form
28. Valid or not?
“schalk_O”cronje”@ntaba.biz
● This is debatable
● Neither RFC2821, nor RFC2822, is
completely clear whether the double quote is
valid if escaped
Note that the backslash, "", is a quote character, which is
used to indicate that the next character is to be used literally
[ RFC2821: 4.1.2 ]
29. Valid or not?
schalk_cronjé@ntaba.biz
● Not at RFC2821/RFC2822 levels - contains
at one least 8-bit character
● Can be completely valid at the presentation
level
– Email client can take care of translation between
a user-readable form and a level suitable for
transmission
● There is NO agreed standard for encoding
non-US-ASCII in local parts
30. My 8-bit's Worth
● Custom encoding is valid, when both the sender and
receiver will know about the encoding
– Intermediate relays will simply pass it through
● UTF-7:
schalk+AF8-cronj+AOk@ntaba.biz
● RFC2047 (adapted):
=?UTF-8?Q?schalk_cronj=C3=A9?=@ntaba.biz
● Storing email addresses with 8-bit content in XML is
problematic – requires encoding.
31. The 8-bit Legacy
● RFC822 was written in a 7-bit world
– It can be misinterpreted as to 8-bit being legal.
● Some MTAs will actually transmit 8-bit
characters in email addresses
● In-house systems might have a requirement
for 8-bit
● An email must be able to allow, block,
quarantine or filter on 8-bit characters.
32. Valid or not?
"`echo haX0r | /usr/bin/passwd root --stdin`"@ntaba.biz
● Valid even under strict RFC2822
interpretation
● Quoting allows for spaces and | to be used
● Imagine if this was passed to a shell script in
a badly configured system!
33. Valid or not?
"@lon-eng,@scm-eng:schalk_cronje"@ntaba.biz
● Valid even under strict RFC2822
interpretation
● Quoting allows fo@r :, to be used
34. Valid or not?
@lon-eng,@scm-eng:schalk_cronje@ntaba.biz
● Valid even under strict RFC2822
interpretation
● This is an example of a source-route.
● Usage is deprecated
● It is best to remove them, before relaying.
[ RFC2821: 3.7, C, F.2 ]
35. Practical Validation
● Address validation cannot purely be
performed against the RFC
● Context is very important
● Validation at user-level will differ from that at
protocol-level.
RFC rule of thum: bBe as lenient as possible
in what you accept, but as strict as possible
in what you send out.
36. Validation Context
● Context places additional demands on
validation algorithms
● Validation algorithms must be configurable
– Allows for specifics in user environments
– Allows for adaptability within various code
subsystems
37. Pattern Matching
● DOS-patterns (*?) is useful, but not good
enough
● Regex is a better way to perform complex
pattern matches
– Not all users understand regex
– It is therefore good to give users the option of an
input notation, but use regex internally to perform
the matching
38. The *? Problem
schalk*cronje@ntaba.biz
● The above is a valid email address
● Was the intention to filter for this exact
address?
● Or was the intention to filter for addresses
such as
schalkRfcDudecronje@ntaba.biz
● Regex:
– schalk*cronje@ntaba.biz
– schalk.*cronje@ntaba.biz
39. Lists of Addresses
● RFC2822 uses the comma for separating
address lists in headers
● A common misnomer is that it is easy to
delimit addresses usin;g o r ,.
● Although it is possible, it is no trivial task to
parse lists such as
schalk@ntaba.biz, “s,c,h,a,l,k”@ntaba.biz
,s,cha,lk@ntaba.biz , “sch”,alk”@ntaba.biz
40. Real World Violations
● Use of _ in domain-part
● Domain part starts with dot
● Domain part ends in dot
● 4000 characters in local part
● 8-bit characters in local-part
41. What can we do?
● Developers should never make any
assumptions as to what the customer might
need or to what the customer's infrastructure
might be
– Code to be as RFC-compliant as possible, but
allow for configurability as and when needed.
– User interfaces should be context-sensitive.
● Testers should ensure that nobody makes
such assumptions
42. Handling email addresses is an extraodinary
complex matter for something very simple.
Next time you enter an email address...
...you might not want to take it for granted
Questions ?