NULL values in SQL databases can degrade data quality and integrity by breaking the relational model that SQL relies on. This leads to increased complexity, incomplete results, and reduced confidence in query outcomes. The presence of NULLs should be managed through table design, data validation, and queries that account for unknown values to minimize these impacts downstream. Analyzing NULLs can also provide insight into missing domain knowledge that requires further exploration.
SQL Server - Handle NULL data with queries, design and MDM
1. SQL Server – How to handle NULL data
Duncan Greaves MSc MCSE
Postgraduate Researcher
2. The Problem with NULL
There is a problem with NULL that has persisted since the Relational
Model was proposed in the 1970’s.
“The simple scientific fact is that an SQL table that contains a null isn’t a relation; thus,
relational theory doesn’t apply, and all bets are off. ” C.Date (2014).
The presence of NULLs in a database ‘breaks’ the relational model of
Boolean expressions on which SQL databases rely.
In ‘real world’ applications of data structures NULLs are often unavoidable.
It confuses users, and designers and DBA’s hate it.
Users need to be aware of the design and query compromises they need
to use.
3. Three Valued Logic
The SQL language is based on Relational Logic.
Adding NULL Values to a database breaks the TRUE/ FALSE relations implicit in the
model and leads to ‘TRUE’, ‘FALSE’ and ‘UNKNOWN’
At best this leads to increased complexity by having to use horizontally decomposed
WHERE clauses, workaround syntax and inference.
At worst leads to incomplete information, returned error codes, interoperability
problems, interpretation problems.
Messes up Reporting, ETL, Business Intelligence and Data Science initiatives.
Reduces confidence in results. Applies to ALL systems, not just databases.
4. Domain Knowledge
SQL Databases are modelled as domains, and as such the designer needs to
be able to define what the domain encompasses by defining the boundaries,
identifying components and relationships.
Almost by definition the designer will have incomplete information about the
information that is relevant, especially when implementing new systems.
NULL is stored as a flag, therefore is not part of any particular domain or type
and in making assumptions about NULLs is where query complexity is
introduced.
Null is not a value, it is not zero, it is unknown
5. NULL Data Scenarios
Existence -Attribute does not exist in the domain, or domain understanding is wrong. E.g eye
colour for a car.
Missing – The information has not been given at the time a row was created. E.g. A customer my
decline to give their age.
Not Yet – Data is contingent upon an unknown event in the future, E.g. Termination date or Date
of death.
Does not apply- Is not applicable for this instance of a record . E.g. Hair colour for bald people,
Number of pregnancies for male patients.
Placeholders – Indicates that we know that a bit of data exists, but we don’t know what it is,
useful for CUBE or ROLLUP queries.
6. Handling NULL in Queries
NULLIF
Syntax: NULLIF (expression, expression)
Returns NULL if both expressions are equal, else returns the first expression.
ISNULL to check the state of a field
Syntax: ISNULL (check_expression, replacement_value)
Returns replacement value that must be implicitly convertible to check expression data type.
COALESCE to use the first non-null field.
Syntax: COALESCE( exp1, exp2,…expn)
Can use multiple input expressions.
Returns the datatype of the expression with highest precedence.
Slower than ISNULL.
7. Handling NULL in WHERE Clauses
Using Three Value Logic (True,False,Unknown). UNKNOWN is the logical outcome and is
not the same as NULL.
To compare values we have to use the IS NULL and IS NOT NULL operators in the
WHERE clause, not the = operator.
IS NULL
SELECT * From Customers WHERE CustName IS NULL
IS NOT NULL
SELECT * From Customers WHERE CustName IS NOT NULL
Use Horizontal Decomposition to add other conditionals
SELECT * From Customers WHERE CustName IS NOT NULL OR Custname=‘Bob’
8. Environment and Aggregate
Settings
ANSI_NULLS environment setting.
When creating or altering stored procedures or User Defined Functions.
This option specifies the setting for ANSI NULL comparisons. When this is on, any query that
compares a value with a null returns a 0. When off, any query that compares a value with a
null returns a null value.
Keep at default value of ON.
“Null value is eliminated by an aggregate or other SET operation”
Trying to do arithmetic or other operation on fields that contain NULLS.
May lead to incomplete information returned.
Use ISNULL or NULLIF to prevent this happening.
9. Table Design Guidelines
Design Integrity into your tables.
Use NOT NULL CHECK() constraints where possible.
Do not use as a Primary key if there is ANY possibility that value
could be NULL.
Avoid in FOREIGN KEY relationships
Consider using de-normalised separate tables to get around this.
Use default field values where appropriate.
Bear in mind arithmetic consequences of using 0, -99 as defaults.
10. App Design Guidelines
Take steps to avoid NULL values from host programs.
Initialisation of variables
Use defaults and appropriate auto filling of variable values
Deduce values
Track missing data using companion codes
Determine the impact of missing data
Validate data and prevent audit difficulties
Use consistent datatypes and nullability across apps
NULL is not “NULL”
11. ETL Guidelines
Where multiple fields may contain NULL, consider using a check
code field to indicate where records need attention.
Check the NULL status of each field using ISNULL (Field,0) and
build a count of the number of fields that fail validation.
Use this as part of the data cleansing process.
Use in ETL and Scrubbing tables
12. Master Data Management
Management is about catching the information before it enters the database and
cleaning up what is already there.
MDS
Master Data Services is included as part of SQL Server.
Allows Models, Entities, Attributes, Rules and Versions to be defined and implemented.
Includes Excel add in. Allows Power users or analysts to define models and rules.
DQS
To help ensure domain validity and knowledge driven data quality.
Good for data correction, enrichment, standardization, and de-duplication.
Other third party applications available
Master Data Maestro, etc.
13. Performance Considerations
Time spent in designing appropriate data quality controls will reduce
the cost of maintaining the database because
NULL slows down the working of indexes.
Increases query retrieval times.
Increase search times.
Increases SQL code complexity.
Decreases the confidence in the information gained from the
database.
14. Data Science
NULLS may provide the catalyst for development of data science to discover why,
or what data or domain knowledge we are lacking.
Null indicates value is not known or indicates missing or incomplete data.
May point to missing entities or uncaptured events.
May skew the results of data tools that disregard NULL values.
Known knowns, Unknown knowns and Unknown unknowns. Data discovery and
Knowledge begin by examining what it is that is unexplained.
15. Summary
The presence of NULL values in SQL databases has always happened, but this degrades the
quality of the information that can be obtained from the data source.
NULLs can have an adverse effect on downstream systems, in particular Reporting, BI,
Predictive Analytics or Machine Learning that rely on the integrity of the data.
Reduce the impact on your information by:
Manage the quality of data going in
Design tables with integrity constraints
Design apps to validate the input
Design queries to ensure correct results are returned
Use NULLs as clues to pick up where domain knowledge is lacking.