I guess this will be the first entry of what I’m going to call my blog. Or a weblog, if you prefer.
I have been wanting to start a blog for a long time. I just never had a place (nor time, which I currently actually don’t have, but shhh) to keep it. I guess this is it.
Anyway, what pushed me over the edge was this: I was browsing StackExchange while I was talking to someone about my vim
configuration. I had :set list
on while sending the screenshot, and well. They asked about my “underscores”. I use ␣
(U+2423) for visual spaces.
I tripped over an answer which I had to comment on. They called the hyphens in the url for DASHES. Who… in… their… right… mind!? Hyphens are NOT dashes!
Apparently, there are a bunch of well defined dashes.
xpath -q -e "//char[@Dash='Y']/@*[name()='cp' or name()='na']" ucd.all.flat.xml | sed -E 's/"|^ |na=//g' | sed 's/cp=/U+/g' | paste -d ':\t' - /dev/null -
U+002D: HYPHEN-MINUS
U+058A: ARMENIAN HYPHEN
U+05BE: HEBREW PUNCTUATION MAQAF
U+1400: CANADIAN SYLLABICS HYPHEN
U+1806: MONGOLIAN TODO SOFT HYPHEN
U+2010: HYPHEN
U+2011: NON-BREAKING HYPHEN
U+2012: FIGURE DASH
U+2013: EN DASH
U+2014: EM DASH
U+2015: HORIZONTAL BAR
U+2053: SWUNG DASH
U+207B: SUPERSCRIPT MINUS
U+208B: SUBSCRIPT MINUS
U+2212: MINUS SIGN
U+2E17: DOUBLE OBLIQUE HYPHEN
U+2E1A: HYPHEN WITH DIAERESIS
U+301C: WAVE DASH
U+3030: WAVY DASH
U+30A0: KATAKANA-HIRAGANA DOUBLE HYPHEN
U+FE31: PRESENTATION FORM FOR VERTICAL EM DASH
U+FE32: PRESENTATION FORM FOR VERTICAL EN DASH
U+FE58: SMALL EM DASH
U+FE63: SMALL HYPHEN-MINUS
U+FF0D: FULLWIDTH HYPHEN-MINUS
I… am… disgusted. Hyphens are dephined¹ as dashes in Unicode.
Nevertheless, I decided to look at the differences between dashes and hyphens according to the Unicode Consortium. There are no dashes in the Unihan dataset, unsurprisingly, but there are 25 of them in the standard set. Same applies for the following on hyphens, except we have 28 in total.
xpath -q -e "//char[@Hyphen='Y' or @Dash='Y']/@*[name()='cp' or name()='na' or name()='Hyphen' or name()='Dash']" ucd.all.flat.xml | sed -E 's/"|^ |na=//g' | sed 's/cp=/U+/g' | sed -z 's/\n/:/g' | sed 's/U+/\nU+/g' | sed -E 's/Dash=N|Hyphen=N//g' | column -t -s:
Call me out on excessively long Bash commands. You’ll get used to it.
U+002D HYPHEN-MINUS Dash=Y Hyphen=Y
U+00AD SOFT HYPHEN Hyphen=Y
U+058A ARMENIAN HYPHEN Dash=Y Hyphen=Y
U+05BE HEBREW PUNCTUATION MAQAF Dash=Y
U+1400 CANADIAN SYLLABICS HYPHEN Dash=Y
U+1806 MONGOLIAN TODO SOFT HYPHEN Dash=Y Hyphen=Y
U+2010 HYPHEN Dash=Y Hyphen=Y
U+2011 NON-BREAKING HYPHEN Dash=Y Hyphen=Y
U+2012 FIGURE DASH Dash=Y
U+2013 EN DASH Dash=Y
U+2014 EM DASH Dash=Y
U+2015 HORIZONTAL BAR Dash=Y
U+2053 SWUNG DASH Dash=Y
U+207B SUPERSCRIPT MINUS Dash=Y
U+208B SUBSCRIPT MINUS Dash=Y
U+2212 MINUS SIGN Dash=Y
U+2E17 DOUBLE OBLIQUE HYPHEN Dash=Y Hyphen=Y
U+2E1A HYPHEN WITH DIAERESIS Dash=Y
U+301C WAVE DASH Dash=Y
U+3030 WAVY DASH Dash=Y
U+30A0 KATAKANA-HIRAGANA DOUBLE HYPHEN Dash=Y
U+30FB KATAKANA MIDDLE DOT Hyphen=Y
U+FE31 PRESENTATION FORM FOR VERTICAL EM DASH Dash=Y
U+FE32 PRESENTATION FORM FOR VERTICAL EN DASH Dash=Y
U+FE58 SMALL EM DASH Dash=Y
U+FE63 SMALL HYPHEN-MINUS Dash=Y Hyphen=Y
U+FF0D FULLWIDTH HYPHEN-MINUS Dash=Y Hyphen=Y
U+FF65 HALFWIDTH KATAKANA MIDDLE DOT Hyphen=Y
Another small detail is that Dash
or Hyphen
are only defined as "N"
for the group E000–F8FF (I used an U+2013 here!) in the grouped version, but are well defined for 107 356 elements in the full and flat version. There are a total of 107 364 defined characters. Properly, 8 of them are defined true for both.
Another detail, instead of a space (U+0020), a figure space (U+2007) was used, as a thousands separator. The International System of Units has declared a space to be used as a decimal separator, and while Wikipedia — Thin Space says they use a thin space to denote this, I disagree. I believe a figure space is the correct option.
A shorter, more useful list:
Unicode | Symbol | Usage |
---|---|---|
U+002D | - | It’s what you get when hitting the numpad! A compromise needed during the 7-bit era. |
U+00AD | | You’re probably not even seing this one, as it’s only used to allow renderers to break a line, with a true hyphen instead! |
U+2010 | ‐ | Hyphens. Used to split super-words (like this one!) into different parts. Also used at the end of a line to mark the splitting of a word. |
U+2011 | ‑ | Visibly, the same as above. Except the renderer mustn’t add a newline after it. |
U+2012 | ‒ | Phone numbers! Used mostly in USA, though. (Witness how we both have figure dashes and figure spaces) |
U+2013 | – | I know this one by heart. En dash. Used to mark periods. Or intervals. Anything with a range, really. I’ve been living during the years 1997–2019. In latex you can use a double hyphen-minus. |
U+2014 | — | This one is also a nice one to know! <br /> Used to add comments —much like parentheses—, to halt mid-sente—! Ah! I almost forgot: It’s also used commonly in bibliographies (or other tabular data) to repeat the previous column's data.<br /> So… many… usages. Also, look at the title of this blog entry. In latex it suffises to use three hyphen-minus symbols. |
U+2015 | ― | Used at the beginning of a new speaker in dialogues. |
U+2053 | ⁓ | Used in examples of dictionaries as a placeholder for the word being defined. I might come back to variations of tildes sometime. |
U+207B | ⁻ | Very useful in mathematics. The arcsin(x) function is often represented as sin⁻ⁱ(x) . |
U+208B | ₋ | Similarly, useful if you need to type xₖ₋₁ . |
U+2212 | − | A true minus sign. x = A−B |
Oh, and don’t forget: You can easily input Unicode characters in Ibus with <kbd>Ctrl</kbd><kbd>Shift</kbd><kbd>U</kbd>+<kbd><Unicode></kbd>+<kbd>␣</kbd>.
I have had some problems in the past with Qt5 not allowing unicode entry, but I usually fix this by installing the libraries for qt-ibus. On Manjaro the package is called ibus-qt
, and needs this in my .bashrc
file:
export GTK_IM_MODULE=ibus
export XMODIFIERS=@im=ibus
export QT_IM_MODULE=ibus
ibus-daemon -drx
Agh, it’s 4:27 a.m. Time to sleep, I guess.
¹: i am aware of the typo, but it's funny so it's staying