FS#61605 - AUR web: Comments with Unicode characters are silently discarded
Attached to Project:
AUR web interface
Opened by Alberto Salvia Novella (es20490446e) - Friday, 01 February 2019, 23:31 GMT
Last edited by Lukas Fleischer (lfleischer) - Tuesday, 21 April 2020, 16:07 GMT
Opened by Alberto Salvia Novella (es20490446e) - Friday, 01 February 2019, 23:31 GMT
Last edited by Lukas Fleischer (lfleischer) - Tuesday, 21 April 2020, 16:07 GMT
|
Details
HOW TO REPRODUCE:
- In an AUR package page add a comment with an Unicode pictograph (https://getemoji.com/) RESULT: - The comment is silently discarded. |
This task depends upon
https://youtu.be/M0UlMpA-7pY
On the topic of this bug report: if the bug report is correct, there must be something different about the AUR that makes this not work in production -- but the only difference that makes sense is I'm using sqlite and the server is using mariadb. As far as I know mariadb should support unicode just fine, but digging around, the settings look a bit odd:
>>> import aurweb.db
>>> from pprint import pprint
>>> conn = aurweb.db.Connection()
>>> cur = conn.execute("SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%'")
>>> pprint(cur.fetchall())
[('character_set_client', 'utf8mb4'),
('character_set_connection', 'utf8mb4'),
('character_set_database', 'utf8'),
('character_set_filesystem', 'binary'),
('character_set_results', 'utf8mb4'),
('character_set_server', 'utf8mb4'),
('character_set_system', 'utf8'),
('collation_connection', 'utf8mb4_general_ci'),
('collation_database', 'utf8_general_ci'),
('collation_server', 'utf8mb4_general_ci')]
I will punt to lfleischer on this. IIRC mysql is weird about utf8 which really isn't unless you use the mb4 version.... So it sounds like in order to support annoying people who use unicode emoji in order to communicate serious messages, we might need to change some of these from utf8 to utf8mb4? This would be a database level problem...
I know unicode currently works for most users, at least to the extent that, say, Chinese can be correctly inserted. But those use 3-byte utf8, not 4-byte characters...