SlideShare una empresa de Scribd logo
1 de 35
Copyright © 2019 Oracle and/or its affiliates. All rights
Copyright © 2019 Oracle and/or its affiliates. All rights
Regular Expressions with full Unicode support
Martin Hansson
Software Development
MySQL Optimizer Team
The ins and outs of the new regular expression functions and the ICU library
Copyright © 2019 Oracle and/or its affiliates. All rights
Safe Harbor Statement
The following is intended to outline our general product direction. It is intended for
information purposes only, and may not be incorporated into any contract. It is not a
commitment to deliver any material, code, or functionality, and should not be relied upon
in making purchasing decisions. The development, release, and timing of any features or
functionality described for Oracle’s products remains at the sole discretion of Oracle.
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
What Happened?
Old regexp library (Henry Spencer)
• Does not support Unicode
• Limited Features
• No resource control
• Only Boolean Search
https://mysqlserverteam.com/new-regular-expression-functions-in-mysql-8-0/
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Not some niche feature
Feature Requests for Extracting Substring:
Bug#79428 No way to extract a substring matching a regex
Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine
Bug#16357 add in functions to do regular expression replacements in a select
query
Bug#9105 Regular expression support for Search & Replace
51 “affects me” total
CTE had 59 “affects me”
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
New Regular Expression Functions
REGEXP_INSTR
REGEXP_LIKE
REGEXP_REPLACE
REGEXP_SUBSTR
Copyright © 2019 Oracle and/or its affiliates. All rights
Program Agenda
Security
ICU library
Unicode
Working with Unicode in Regular Expressions
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Two Security Concerns
Memory Runtime
8
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Security
Cap on runtime
mysql> SELECT regexp_instr(
'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC',
'(A+)+B');
ERROR 3699 (HY000): Timeout exceeded in regular expression
match.
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Security
Cap on Memory
mysql> SELECT regexp_instr(
'', '(((((((){120}){11}){11}){11}){80}){11}){4}' );
ERROR 3699 (HY000): Timeout exceeded in regular expression match.
mysql> SET GLOBAL regexp_stack_limit = 239;
mysql> SELECT regexp_instr(
'', '(((((((){120}){11}){11}){11}){80}){11}){4}' );
ERROR 3698 (HY000): Overflow in the regular expression backtrack stack.
Copyright © 2019 Oracle and/or its affiliates. All rights
Program Agenda
Security
ICU library
Unicode
Working with Unicode in Regular Expressions
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
ICU library
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Building ICU
Need three libraries
• i18n library
– Regular expressions
– Character sets
• Common library
• Data Library
Copyright © 2019 Oracle and/or its affiliates. All rights
Program Agenda
Security
ICU library
Unicode
Working with Unicode in Regular Expressions
15
UTF-32
ab d
0x00000061 0x000000610x00000061 0x000000610x00000062 0x000000640x000000610x000000610x000000610x0001f37a
16
UTF-8
ab d
0x62 0x000000610x000000610x000000610xF09F8DBA0x62 0x64
17
UTF-16
ab d
0x0062 0x3CD87ADF0x0062 0x0064
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Under the Hood
• Count codepoints
• Convert to UTF-16
• Use the C API
• Convert back if needed
Copyright © 2019 Oracle and/or its affiliates. All rights
Program Agenda
Security
ICU library
Unicode
Working with Unicode in Regular Expressions
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Case folding
Simple case sensitivity
mysql> SELECT regexp_like( 'a', '(?i)A' ); # mode modifier
1
mysql> SELECT regexp_like( 'a', 'A', ‘i’ ); # match_parameter
1
mysql> SELECT regexp_like(
'a' COLLATE utf8mb4_0900_as_cs, 'A' ); # collation
0
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Case folding
Simple case sensitivity
mysql> SELECT regexp_like( 'Abc', 'abC', ‘c’ );
→ 0
mysql> SELECT regexp_like( 'Abc', 'abC', ‘i’ );
→ 1
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Case folding
Case-mapping process
A → a
B → b
C → c
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Case folding
Full Case Folding
ß → ss
mysql> SELECT regexp_like( 'ß', '^ss$', ‘c’ );
→ 0
mysql> SELECT regexp_like( 'ß', '^ss$', ‘i’ );
→ 1
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Case folding
Full Case Folding
ᾛ ⇒ ἣι
U+1F9B GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND
PROSGEGRAMMENI
U+1F23 U+03B9 GREEK SMALL LETTER ETA WITH DASIA AND VARIA
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Case folding
Has to Look Like a String in order to Match
mysql> SELECT regexp_like( 'ß', '^ss$' );
→ 1
mysql> SELECT regexp_like( 'ß', '^s+$' );
→ 0
mysql> SELECT regexp_like( 'ß', '^s{2}$' );
→ 0
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Case folding
Can’t start Match Within Expanded Character
mysql> SELECT regexp_like( 'ß', 's$' );
→ 0
mysql> SELECT regexp_like( 'ß', '^s' );
→ 0
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Case folding
Collations
mysql> select 'ß' collate utf8mb4_de_pb_0900_ai_ci = 'ss'G
*************************** 1. row
'ß' collate utf8mb4_de_pb_0900_ai_ci = 'ss': 1
mysql> select 'ß' collate utf8mb4_de_pb_0900_as_cs = 'ss'G
*************************** 1. row
'ß' collate utf8mb4_de_pb_0900_as_cs = 'ss': 0
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Case folding
Language Dependent Case Folding
mysql> SELECT regexp_like( 'I', 'i' );
→ 1
mysql> SELECT regexp_like( 'İ', 'i' );
→ 0
mysql> SELECT regexp_like( 'I', ' ı' );
→ 0
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Beware of Conversion!
mysql> set names latin1;
mysql> create table t1 ( a char ( 10 ) );
mysql> insert into t1 values ( 'å' );
mysql> select a from t1G
*************************** 1. row
a: å
mysql> select regexp_like( a, 'å' ) from t1G
*************************** 1. row
regexp_like( a, 'å' ): 1
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Beware of Conversion!
Use Hex Codes!
mysql> select hex( a ) from t1;
+----------+
| hex( a ) |
+----------+
| C383C2A5 |
+----------+
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Beware of Conversion!
Use Hex Codes!
mysql> select hex( a ) from t1;
+----------+
| hex( a ) |
+----------+
| C383C2A5 |
+----------+
Latin-1: 0x e5
UTF-8: 0x c3 a5
å is encoded as:
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 32
Conversion flow
Terminal UTF-8
c3a5 å
Latin-1 → UTF-8
UTF-8 → Latin-1
C383C2A5 = å
Server
Table UTF-8
Server
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Power Tip
Use Hex Codes
and Character set
Introducers!
mysql> set global character_set_client = utf8mb4;
mysql> select _utf8mb4 0xc3a5, _latin1 0xe5;
+-----------------+--------------+
| _utf8mb4 0xc3a5 | _latin1 0xe5 |
+-----------------+--------------+
| å | å |
+-----------------+--------------+
mysql> set global character_set_client = latin1;
mysql> select _utf8mb4 0xc3a5, _latin1 0xe5;
+-----------------+--------------+
| _utf8mb4 0xc3a5 | _latin1 0xe5 |
+-----------------+--------------+
| å | å |
+-----------------+--------------+
Copyright © 2019 Oracle and/or its affiliates. All rights
Questions?
Regular Expressions with full Unicode support

Más contenido relacionado

La actualidad más candente

MySQL partitioning
MySQL partitioning MySQL partitioning
MySQL partitioning OracleMySQL
 
OpenWorld 2018 - 20 years of hints and tips
OpenWorld 2018 - 20 years of hints and tipsOpenWorld 2018 - 20 years of hints and tips
OpenWorld 2018 - 20 years of hints and tipsConnor McDonald
 
Pattern Matching with SQL - APEX World Rotterdam 2019
Pattern Matching with SQL - APEX World Rotterdam 2019Pattern Matching with SQL - APEX World Rotterdam 2019
Pattern Matching with SQL - APEX World Rotterdam 2019Connor McDonald
 
Python and the MySQL Document Store
Python and the MySQL Document StorePython and the MySQL Document Store
Python and the MySQL Document StoreJesper Wisborg Krogh
 
18c and 19c features for DBAs
18c and 19c features for DBAs18c and 19c features for DBAs
18c and 19c features for DBAsConnor McDonald
 
DI Frameworks - hidden pearls
DI Frameworks - hidden pearlsDI Frameworks - hidden pearls
DI Frameworks - hidden pearlsSven Ruppert
 
Proxy deep-dive java-one_20151027_001
Proxy deep-dive java-one_20151027_001Proxy deep-dive java-one_20151027_001
Proxy deep-dive java-one_20151027_001Sven Ruppert
 
HOW TO SCALE FROM ZERO TO BILLIONS!
HOW TO SCALE FROM ZERO TO BILLIONS!HOW TO SCALE FROM ZERO TO BILLIONS!
HOW TO SCALE FROM ZERO TO BILLIONS!Maziyar PANAHI
 
Agile Database Development with JSON
Agile Database Development with JSONAgile Database Development with JSON
Agile Database Development with JSONChris Saxon
 
Lazy vs. Eager Loading Strategies in JPA 2.1
Lazy vs. Eager Loading Strategies in JPA 2.1Lazy vs. Eager Loading Strategies in JPA 2.1
Lazy vs. Eager Loading Strategies in JPA 2.1Patrycja Wegrzynowicz
 
Latin America tour 2019 - Flashback
Latin America tour 2019 -  FlashbackLatin America tour 2019 -  Flashback
Latin America tour 2019 - FlashbackConnor McDonald
 
Locking and Concurrency Control
Locking and Concurrency ControlLocking and Concurrency Control
Locking and Concurrency ControlMorgan Tocker
 
MySQL Goes to 8! FOSDEM 2020 Database Track, January 2nd, 2020
MySQL Goes to 8!  FOSDEM 2020 Database Track, January 2nd, 2020MySQL Goes to 8!  FOSDEM 2020 Database Track, January 2nd, 2020
MySQL Goes to 8! FOSDEM 2020 Database Track, January 2nd, 2020Geir Høydalsvik
 

La actualidad más candente (15)

MySQL partitioning
MySQL partitioning MySQL partitioning
MySQL partitioning
 
OpenWorld 2018 - 20 years of hints and tips
OpenWorld 2018 - 20 years of hints and tipsOpenWorld 2018 - 20 years of hints and tips
OpenWorld 2018 - 20 years of hints and tips
 
Pattern Matching with SQL - APEX World Rotterdam 2019
Pattern Matching with SQL - APEX World Rotterdam 2019Pattern Matching with SQL - APEX World Rotterdam 2019
Pattern Matching with SQL - APEX World Rotterdam 2019
 
Python and the MySQL Document Store
Python and the MySQL Document StorePython and the MySQL Document Store
Python and the MySQL Document Store
 
18c and 19c features for DBAs
18c and 19c features for DBAs18c and 19c features for DBAs
18c and 19c features for DBAs
 
Second Level Cache in JPA Explained
Second Level Cache in JPA ExplainedSecond Level Cache in JPA Explained
Second Level Cache in JPA Explained
 
Les02
Les02Les02
Les02
 
DI Frameworks - hidden pearls
DI Frameworks - hidden pearlsDI Frameworks - hidden pearls
DI Frameworks - hidden pearls
 
Proxy deep-dive java-one_20151027_001
Proxy deep-dive java-one_20151027_001Proxy deep-dive java-one_20151027_001
Proxy deep-dive java-one_20151027_001
 
HOW TO SCALE FROM ZERO TO BILLIONS!
HOW TO SCALE FROM ZERO TO BILLIONS!HOW TO SCALE FROM ZERO TO BILLIONS!
HOW TO SCALE FROM ZERO TO BILLIONS!
 
Agile Database Development with JSON
Agile Database Development with JSONAgile Database Development with JSON
Agile Database Development with JSON
 
Lazy vs. Eager Loading Strategies in JPA 2.1
Lazy vs. Eager Loading Strategies in JPA 2.1Lazy vs. Eager Loading Strategies in JPA 2.1
Lazy vs. Eager Loading Strategies in JPA 2.1
 
Latin America tour 2019 - Flashback
Latin America tour 2019 -  FlashbackLatin America tour 2019 -  Flashback
Latin America tour 2019 - Flashback
 
Locking and Concurrency Control
Locking and Concurrency ControlLocking and Concurrency Control
Locking and Concurrency Control
 
MySQL Goes to 8! FOSDEM 2020 Database Track, January 2nd, 2020
MySQL Goes to 8!  FOSDEM 2020 Database Track, January 2nd, 2020MySQL Goes to 8!  FOSDEM 2020 Database Track, January 2nd, 2020
MySQL Goes to 8! FOSDEM 2020 Database Track, January 2nd, 2020
 

Similar a Regular Expressions with full Unicode support

MySQL 8.0 InnoDB Cluster demo
MySQL 8.0 InnoDB Cluster demoMySQL 8.0 InnoDB Cluster demo
MySQL 8.0 InnoDB Cluster demoKeith Hollman
 
MySQL NoSQL JSON JS Python "Document Store" demo
MySQL NoSQL JSON JS Python "Document Store" demoMySQL NoSQL JSON JS Python "Document Store" demo
MySQL NoSQL JSON JS Python "Document Store" demoKeith Hollman
 
Top 10 SQL Performance tips & tricks for Java Developers
Top 10 SQL Performance tips & tricks for Java DevelopersTop 10 SQL Performance tips & tricks for Java Developers
Top 10 SQL Performance tips & tricks for Java Developersgvenzl
 
20160821 coscup-my sql57docstorelab01
20160821 coscup-my sql57docstorelab0120160821 coscup-my sql57docstorelab01
20160821 coscup-my sql57docstorelab01Ivan Ma
 
Performance Schema and Sys Schema in MySQL 5.7
Performance Schema and Sys Schema in MySQL 5.7Performance Schema and Sys Schema in MySQL 5.7
Performance Schema and Sys Schema in MySQL 5.7Mark Leith
 
Mysql tech day_paris_ps_and_sys
Mysql tech day_paris_ps_and_sysMysql tech day_paris_ps_and_sys
Mysql tech day_paris_ps_and_sysMark Leith
 
MySQL Troubleshooting with the Performance Schema
MySQL Troubleshooting with the Performance SchemaMySQL Troubleshooting with the Performance Schema
MySQL Troubleshooting with the Performance SchemaSveta Smirnova
 
APEX Connect 2019 - SQL Tuning 101
APEX Connect 2019 - SQL Tuning 101APEX Connect 2019 - SQL Tuning 101
APEX Connect 2019 - SQL Tuning 101Connor McDonald
 
12 Things Developers Will Love About Oracle Database 12c Release 2
12 Things Developers Will Love About Oracle Database 12c Release 212 Things Developers Will Love About Oracle Database 12c Release 2
12 Things Developers Will Love About Oracle Database 12c Release 2Chris Saxon
 
APEX Connect 2019 - successful application development
APEX Connect 2019 - successful application developmentAPEX Connect 2019 - successful application development
APEX Connect 2019 - successful application developmentConnor McDonald
 
20161029 py con-mysq-lv3
20161029 py con-mysq-lv320161029 py con-mysq-lv3
20161029 py con-mysq-lv3Ivan Ma
 
20190615 hkos-mysql-troubleshootingandperformancev2
20190615 hkos-mysql-troubleshootingandperformancev220190615 hkos-mysql-troubleshootingandperformancev2
20190615 hkos-mysql-troubleshootingandperformancev2Ivan Ma
 
What’s New in Oracle Database 12c for PHP
What’s New in Oracle Database 12c for PHPWhat’s New in Oracle Database 12c for PHP
What’s New in Oracle Database 12c for PHPChristopher Jones
 
Graal and Truffle: One VM to Rule Them All
Graal and Truffle: One VM to Rule Them AllGraal and Truffle: One VM to Rule Them All
Graal and Truffle: One VM to Rule Them AllThomas Wuerthinger
 
Ruby on Rails Oracle adaptera izstrāde
Ruby on Rails Oracle adaptera izstrādeRuby on Rails Oracle adaptera izstrāde
Ruby on Rails Oracle adaptera izstrādeRaimonds Simanovskis
 
MySQL 8 High Availability with InnoDB Clusters
MySQL 8 High Availability with InnoDB ClustersMySQL 8 High Availability with InnoDB Clusters
MySQL 8 High Availability with InnoDB ClustersMiguel Araújo
 

Similar a Regular Expressions with full Unicode support (20)

MySQL 8.0 InnoDB Cluster demo
MySQL 8.0 InnoDB Cluster demoMySQL 8.0 InnoDB Cluster demo
MySQL 8.0 InnoDB Cluster demo
 
MySQL NoSQL JSON JS Python "Document Store" demo
MySQL NoSQL JSON JS Python "Document Store" demoMySQL NoSQL JSON JS Python "Document Store" demo
MySQL NoSQL JSON JS Python "Document Store" demo
 
Top 10 SQL Performance tips & tricks for Java Developers
Top 10 SQL Performance tips & tricks for Java DevelopersTop 10 SQL Performance tips & tricks for Java Developers
Top 10 SQL Performance tips & tricks for Java Developers
 
20160821 coscup-my sql57docstorelab01
20160821 coscup-my sql57docstorelab0120160821 coscup-my sql57docstorelab01
20160821 coscup-my sql57docstorelab01
 
MySQL Replication
MySQL ReplicationMySQL Replication
MySQL Replication
 
Performance Schema and Sys Schema in MySQL 5.7
Performance Schema and Sys Schema in MySQL 5.7Performance Schema and Sys Schema in MySQL 5.7
Performance Schema and Sys Schema in MySQL 5.7
 
Mysql tech day_paris_ps_and_sys
Mysql tech day_paris_ps_and_sysMysql tech day_paris_ps_and_sys
Mysql tech day_paris_ps_and_sys
 
MySQL NoSQL APIs
MySQL NoSQL APIsMySQL NoSQL APIs
MySQL NoSQL APIs
 
MySQL Troubleshooting with the Performance Schema
MySQL Troubleshooting with the Performance SchemaMySQL Troubleshooting with the Performance Schema
MySQL Troubleshooting with the Performance Schema
 
MySQL8.0 in COSCUP2017
MySQL8.0 in COSCUP2017MySQL8.0 in COSCUP2017
MySQL8.0 in COSCUP2017
 
APEX Connect 2019 - SQL Tuning 101
APEX Connect 2019 - SQL Tuning 101APEX Connect 2019 - SQL Tuning 101
APEX Connect 2019 - SQL Tuning 101
 
12 Things Developers Will Love About Oracle Database 12c Release 2
12 Things Developers Will Love About Oracle Database 12c Release 212 Things Developers Will Love About Oracle Database 12c Release 2
12 Things Developers Will Love About Oracle Database 12c Release 2
 
APEX Connect 2019 - successful application development
APEX Connect 2019 - successful application developmentAPEX Connect 2019 - successful application development
APEX Connect 2019 - successful application development
 
20161029 py con-mysq-lv3
20161029 py con-mysq-lv320161029 py con-mysq-lv3
20161029 py con-mysq-lv3
 
20190615 hkos-mysql-troubleshootingandperformancev2
20190615 hkos-mysql-troubleshootingandperformancev220190615 hkos-mysql-troubleshootingandperformancev2
20190615 hkos-mysql-troubleshootingandperformancev2
 
What’s New in Oracle Database 12c for PHP
What’s New in Oracle Database 12c for PHPWhat’s New in Oracle Database 12c for PHP
What’s New in Oracle Database 12c for PHP
 
Graal and Truffle: One VM to Rule Them All
Graal and Truffle: One VM to Rule Them AllGraal and Truffle: One VM to Rule Them All
Graal and Truffle: One VM to Rule Them All
 
Ruby on Rails Oracle adaptera izstrāde
Ruby on Rails Oracle adaptera izstrādeRuby on Rails Oracle adaptera izstrāde
Ruby on Rails Oracle adaptera izstrāde
 
Rootconf admin101
Rootconf admin101Rootconf admin101
Rootconf admin101
 
MySQL 8 High Availability with InnoDB Clusters
MySQL 8 High Availability with InnoDB ClustersMySQL 8 High Availability with InnoDB Clusters
MySQL 8 High Availability with InnoDB Clusters
 

Último

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 

Último (20)

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 

Regular Expressions with full Unicode support

  • 1. Copyright © 2019 Oracle and/or its affiliates. All rights
  • 2. Copyright © 2019 Oracle and/or its affiliates. All rights Regular Expressions with full Unicode support Martin Hansson Software Development MySQL Optimizer Team The ins and outs of the new regular expression functions and the ICU library
  • 3. Copyright © 2019 Oracle and/or its affiliates. All rights Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.
  • 4. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | What Happened? Old regexp library (Henry Spencer) • Does not support Unicode • Limited Features • No resource control • Only Boolean Search https://mysqlserverteam.com/new-regular-expression-functions-in-mysql-8-0/
  • 5. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Not some niche feature Feature Requests for Extracting Substring: Bug#79428 No way to extract a substring matching a regex Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in a select query Bug#9105 Regular expression support for Search & Replace 51 “affects me” total CTE had 59 “affects me”
  • 6. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | New Regular Expression Functions REGEXP_INSTR REGEXP_LIKE REGEXP_REPLACE REGEXP_SUBSTR
  • 7. Copyright © 2019 Oracle and/or its affiliates. All rights Program Agenda Security ICU library Unicode Working with Unicode in Regular Expressions
  • 8. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Two Security Concerns Memory Runtime 8
  • 9. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Security Cap on runtime mysql> SELECT regexp_instr( 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC', '(A+)+B'); ERROR 3699 (HY000): Timeout exceeded in regular expression match.
  • 10. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Security Cap on Memory mysql> SELECT regexp_instr( '', '(((((((){120}){11}){11}){11}){80}){11}){4}' ); ERROR 3699 (HY000): Timeout exceeded in regular expression match. mysql> SET GLOBAL regexp_stack_limit = 239; mysql> SELECT regexp_instr( '', '(((((((){120}){11}){11}){11}){80}){11}){4}' ); ERROR 3698 (HY000): Overflow in the regular expression backtrack stack.
  • 11. Copyright © 2019 Oracle and/or its affiliates. All rights Program Agenda Security ICU library Unicode Working with Unicode in Regular Expressions
  • 12. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | ICU library
  • 13. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Building ICU Need three libraries • i18n library – Regular expressions – Character sets • Common library • Data Library
  • 14. Copyright © 2019 Oracle and/or its affiliates. All rights Program Agenda Security ICU library Unicode Working with Unicode in Regular Expressions
  • 15. 15 UTF-32 ab d 0x00000061 0x000000610x00000061 0x000000610x00000062 0x000000640x000000610x000000610x000000610x0001f37a
  • 18. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Under the Hood • Count codepoints • Convert to UTF-16 • Use the C API • Convert back if needed
  • 19. Copyright © 2019 Oracle and/or its affiliates. All rights Program Agenda Security ICU library Unicode Working with Unicode in Regular Expressions
  • 20. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Case folding Simple case sensitivity mysql> SELECT regexp_like( 'a', '(?i)A' ); # mode modifier 1 mysql> SELECT regexp_like( 'a', 'A', ‘i’ ); # match_parameter 1 mysql> SELECT regexp_like( 'a' COLLATE utf8mb4_0900_as_cs, 'A' ); # collation 0
  • 21. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Case folding Simple case sensitivity mysql> SELECT regexp_like( 'Abc', 'abC', ‘c’ ); → 0 mysql> SELECT regexp_like( 'Abc', 'abC', ‘i’ ); → 1
  • 22. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Case folding Case-mapping process A → a B → b C → c
  • 23. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Case folding Full Case Folding ß → ss mysql> SELECT regexp_like( 'ß', '^ss$', ‘c’ ); → 0 mysql> SELECT regexp_like( 'ß', '^ss$', ‘i’ ); → 1
  • 24. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Case folding Full Case Folding ᾛ ⇒ ἣι U+1F9B GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI U+1F23 U+03B9 GREEK SMALL LETTER ETA WITH DASIA AND VARIA
  • 25. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Case folding Has to Look Like a String in order to Match mysql> SELECT regexp_like( 'ß', '^ss$' ); → 1 mysql> SELECT regexp_like( 'ß', '^s+$' ); → 0 mysql> SELECT regexp_like( 'ß', '^s{2}$' ); → 0
  • 26. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Case folding Can’t start Match Within Expanded Character mysql> SELECT regexp_like( 'ß', 's$' ); → 0 mysql> SELECT regexp_like( 'ß', '^s' ); → 0
  • 27. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Case folding Collations mysql> select 'ß' collate utf8mb4_de_pb_0900_ai_ci = 'ss'G *************************** 1. row 'ß' collate utf8mb4_de_pb_0900_ai_ci = 'ss': 1 mysql> select 'ß' collate utf8mb4_de_pb_0900_as_cs = 'ss'G *************************** 1. row 'ß' collate utf8mb4_de_pb_0900_as_cs = 'ss': 0
  • 28. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Case folding Language Dependent Case Folding mysql> SELECT regexp_like( 'I', 'i' ); → 1 mysql> SELECT regexp_like( 'İ', 'i' ); → 0 mysql> SELECT regexp_like( 'I', ' ı' ); → 0
  • 29. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Beware of Conversion! mysql> set names latin1; mysql> create table t1 ( a char ( 10 ) ); mysql> insert into t1 values ( 'å' ); mysql> select a from t1G *************************** 1. row a: å mysql> select regexp_like( a, 'å' ) from t1G *************************** 1. row regexp_like( a, 'å' ): 1
  • 30. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Beware of Conversion! Use Hex Codes! mysql> select hex( a ) from t1; +----------+ | hex( a ) | +----------+ | C383C2A5 | +----------+
  • 31. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Beware of Conversion! Use Hex Codes! mysql> select hex( a ) from t1; +----------+ | hex( a ) | +----------+ | C383C2A5 | +----------+ Latin-1: 0x e5 UTF-8: 0x c3 a5 å is encoded as:
  • 32. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 32 Conversion flow Terminal UTF-8 c3a5 å Latin-1 → UTF-8 UTF-8 → Latin-1 C383C2A5 = Ã¥ Server Table UTF-8 Server
  • 33. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Power Tip Use Hex Codes and Character set Introducers! mysql> set global character_set_client = utf8mb4; mysql> select _utf8mb4 0xc3a5, _latin1 0xe5; +-----------------+--------------+ | _utf8mb4 0xc3a5 | _latin1 0xe5 | +-----------------+--------------+ | å | å | +-----------------+--------------+ mysql> set global character_set_client = latin1; mysql> select _utf8mb4 0xc3a5, _latin1 0xe5; +-----------------+--------------+ | _utf8mb4 0xc3a5 | _latin1 0xe5 | +-----------------+--------------+ | å | å | +-----------------+--------------+
  • 34. Copyright © 2019 Oracle and/or its affiliates. All rights Questions?

Notas del editor

  1. I am Worked with MySQL since time immemorial, MySQL AB. Work from Uppsala Sweden, former head office for MySQL. Swedish is my native tongue. Makes a differrence as you will see.
  2. I am Worked with MySQL since time immemorial, MySQL AB. Work from Uppsala Sweden, former head office for MySQL. Swedish is my native tongue. Makes a differrence as you will see.
  3. So what’s all this about? We switched our regex library in 8.0.4. At the time I blogged about it here. The old one was written by HS in 1986. called regexp very good regex library. Has been used widely in the Unix realm, is part of POSIX standard. Also called “the book regexp library” because he updated it for the book Software Solutions in C in 1994. . Made its way to Tcl, Postgres and even early perl. Apparently Postgres still use it. Really good. Great performance. But ASCII only. Worked byte-by-byte. Lacks many features. Not safe – Easy to put in infinite loop. You can only do boolean search, not do matching doesn’t have a pattern buffer out of the box. Hence doesn’t support search-replace.
  4. And this was a quite popular request. Four FR bugs against getting the matched substring alone. We had 51 “Affects me” in total. CTE had 59, but that’s a really popular feature.
  5. Now we have four functions Instr → position, before or after Like → boolean Replace → replaces a match, capture Substr → the matched substring
  6. So here’s the agenda. On top, we have security, which is why we chose ICU. Perhaps not the obvious choice given the candidates. It also has close ties to Unicode. What I won’t cover here is all the features of regular expressions. These are documented in our manual and if you can always head to the ICU documentations. My ambition is to teach you about how to work efficiently and securely with unicode and to give some insight into where common wisdom breaks down. I presented here 3 years ago and I had a really good time, so I wanted to go again. I told my boss what am I going to talk about, I haven’t really added anything new since last time. All I can think of is the new regular expression. “Tell’em about that, he said” They’ll love that. So, I submitted this talk as a 20 minute presentation. Not only did it get accepted but it got upgraded to 30 minutes. I couldn’t think of much to say, so I asked around. “What do YOU want to know about regular expressions with Unicode?”. Nobody had a clue. So that’s why I just picked some common pitfalls that I consider tricky.
  7. The way a malicious user can exploit regex matching is by exhausting the memory or creating an infinite loop, consuming all the cpu time.
  8. Out of the box there’s always cap on runtime. Runtime is specified in “steps of the match engine. A bit vacuous. Correspondence with actual processor time will depend on the speed of the processor and the details of the specific pattern, but will typically be on the order of milliseconds. Match the first A, capture, then repeat that match. Backtrack, match 2nd, repeat that and so on. Eventually fail because of the C. Set conservatively to 32 (secure by default)
  9. Here I’m trying to run out of memory. Really have to provoke here. Reach the time limit first. Match empty string 120 times, repeat that 11 times, repeat that 11 times, etc. Backtracking stack used by engine. Bytes. Choking to 239 bytes Default size 80 MB. Never managed to DOS server.
  10. So… about the icu library
  11. What is ICU library. Set of I18n libs. What they provide is Globalization support and Unicode for software applications. They have an open source license. From what I gather compatible with GNU, but IANAL. Used by Java, Apple, Amazon, IBM… Unicode consortium mostly known for emoji nowadays. New releases of Unicode typically contain new emojis. And so you have to be able search for them. Haha-papa a.k.a. Sushi-beer bug. And so regexp have to suport them. 💬 5 billion emojis are sent daily on Facebook Messenger 📸 By mid-2015, half of all comments on Instagram included an emoji 🍑 Only 7% of people use the peach emoji as a fruit The rest mostly use it as a butt or for other non-fruit uses According to emojipedia In a sense ICU is Unicode. Support for all of Unicode
  12. We ship ICU with MySQL, and optionally build bundled. We ship 59.1. I notice Ubuntu 18.04 ships 60. There’s the internationalization library which contains regexp and charsets. All we use right now. All we bundle. The common library contains things like the breakiterator which helps work with grapheme clusters. I won’t go into grapheme clusters in this presentation. We don’t handle those yet. The data library is not used currently. Don’t ship. Fairly big, not needed for regexp.
  13. Tell you a bit about Unicode Specifies three encodings.
  14. + constant size + maps 1-to-1 to unicode codepoints - space consuming
  15. + Optimized for Western ASCII + Small (for Western) + Self-synchronizing (what isn’t???) - Variable size De-facto standard for the web 92.9%
  16. Generally regarded Worst of both worlds - Bigger than UTF-8 - Not fixed like UTF-32 + More is constant (what? Which planes?) + Also self-synchronizing Surrogate pairs Broken in Java. How? Alas, used by ICU
  17. So they way we use ICU is, unless you start on the first character, we count the code points before. Convert the rest to UTF-16, search with ICU. We use ICU’s C API. There is C++ API.
  18. So, I have two examples how to work with Unicode.
  19. You can specify case sensitivity in three ways. Mode modifiers Inside the regexp have the highest priorority. If there are no mode modifiers, match_paramete is used. String of modifiers. ‘c’ means case-sensitive, ‘i’ means in-sensitive. If there are none of those, we look at the collation. There are rules for computing which collation should be used in any comparison. Apply here.
  20. Case insensitivity seems simple at first. Text is normalized by transforming to the same case. Then compare. On the next slide we see how such a case mapping could look.
  21. Totally obvious, right? One character maps to exactly one character. This is called simple case insensitive matching. Well there are some trickier cases.
  22. The german Ess-zet is generally understood to be equivalent to two s’es. So in full case insensitive matching they should be equal. Since there is no esszet in any other language, this folding is part of the default. I could go on all day about case mapping, it’s a 61-page document in the Unicode standard. But these are the essentials.
  23. This example is a little more complicated for me. Here one letter obviously maps to two letters. Actually letters. Not just code points. If you paste them and press backspace, the little I goes away. In this case they’re different. Works the same way. It’s all greek to me.
  24. Full case folding used when the pattern contains anything looks like a character string, even just one char.
  25. A match can never start within an expanded character. The anchors here enforce a match that would 1) start in the middle 2) end in the middle
  26. This is consistent with how collations work with the equals predicate. Hard to read collation name Charset, language code, pb – don’t know, accent sensitive, case sensitive.
  27. Case folding can also be language dependent. In the default case folding, capital I folds to small I with dot. However, in the turkish case folding, a dotted capital I is case folded to dotted lowercase I. Dotless capital I folds to dotless lowercase I. In Turkish locale, actually wrong.
  28. Another problem with full Unicode and regexp. You need to be careful when you send non-ASCII data from a client. Here is a cautionary tale. Here I changed the variables character_set_connection, character_set_client and character_set_results. What SET NAMES does. So, I create a table. I populate it. Swedish letter å. Pronounced Read back. Check with a regular expression match. So, everything is fine, right? Let’s do a “trust but verify” here. I want to see what’s actualy in the table. The problem is that it will always be converted to my character set. I want to apply a function to it on the server side. Problem is, all functions will also convert their arguments. What to do? All functions save one: The hex() function. It will tell the truth.
  29. So here we have .. what? Is this really a w/ring ? Let’s check.
  30. This is not å in any encoding. What is going on
  31. My terminal is UTF-8. So, when I type å on my Swedish keyboard, it sends c3a5 to the server. Now, when I set character_set_client, what I really said “interpret as latin1”. Fine c3a5 thats a-wave yen. Stores that. But the table stores utf8 so let’s convert. And that becomes Now, when I do select, it reads character_set_results, oh yeah, you speak latin-1. Let me translate for ya. And so we’re back full circle. Especially tricky with latin-1 since anything is valid latin-1. No check fails.
  32. So here’s a power tip for troubleshooting your multilinguas regexps. If you use hex codes and character set introducers, it’s totally unambiguous. As you see here.