-
-
Notifications
You must be signed in to change notification settings - Fork 66
Improve use of CharacterEncoding #1735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 6 commits
e02e3a9
586d3a4
96ea4e8
cd526f5
dc9c8ad
1a53c1a
3d4b0a5
0218bd9
1324c41
f58574c
79dcf9d
e7f88e5
0595532
1670da6
d9e7eb5
b01006c
8a33df5
2d82c6d
68cb8b9
cd3cf0d
b743191
ba5d790
17ab410
55ac83b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -419,10 +419,15 @@ def format_output(self, expr, format=None): | |
| if result is None: | ||
| return None | ||
|
|
||
| try: | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This ensures that when expressions are formatted as text, the encoding is always applied. With this change, if we specify that the encoding is "ASCII", all the tests must match with ASCII outputs. |
||
| encoding = self.definitions.get_ownvalue("System`$CharacterEncoding").value | ||
| except AttributeError: | ||
| encoding = "Unicode" | ||
|
|
||
| try: | ||
| # With the new implementation, if result is not a ``BoxExpression`` | ||
| # then we should raise a BoxError here. | ||
| boxes = result.to_text(evaluation=self) | ||
| boxes = result.to_text(evaluation=self, encoding=encoding) | ||
| except BoxError: | ||
| self.message( | ||
| "General", "notboxes", Expression(SymbolFullForm, result).evaluate(self) | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -23,9 +23,17 @@ | |
| def eval_ToString( | ||
| expr: BaseElement, form: Symbol, encoding: String, evaluation: Evaluation | ||
| ) -> String: | ||
| boxes = format_element(expr, evaluation, form, encoding=encoding) | ||
| text = boxes.to_text(evaluation=evaluation) | ||
| return String(text) | ||
| from mathics.format.render.text import EncodingNameError | ||
|
|
||
| boxes = format_element(expr, evaluation, form) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the final idea is that the strings in Instead,
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, but this doesn't align with how the experiments I showed you suggest WMA works. It does not matter how you create a string or a Box expression; in the end, an encoding pass is applied. And if you do the conversion earlier, a double conversion spoils the result.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I did not find anywhere in those experiments that there was a string that was encoded one way, and inside
That is not at issue here. What is at issue here is taking a string that was wrongly encoded and re-encoding it. Consider this example where I set a breakpoint at the location we are discussing:
This is not relevant here. We started with a Mathics3 Expression, and inside Another viable solution might be to have
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I have been looking again this, and again, this is a central misunderstanding: as I see this, the line 28 must return a boxed expression that uses the internal representation (Unicode/UTF-8). Then, the result which takes the box expression and converts it into a Python string, in the request encoding. The advantage of this approach is that all the codepage translation machinary is completely localized in one module. The drawback is that we have to scan each character to see if we need to translate it. But this is how WMA does it, and I guess they developers had very good reasons to do in this way. |
||
| try: | ||
| return String(boxes.to_text(evaluation=evaluation, encoding=encoding)) | ||
| except EncodingNameError: | ||
| # Mimic the WMA behavior. In the future, we can implement the mechanism | ||
| # with encodings stored in .m files, and give a chance with it. | ||
| evaluation.message("Get", "noopen", String("encodings/" + encoding + "." + "m")) | ||
|
|
||
| return String(boxes.to_text(evaluation=evaluation, encoding="Unicode")) | ||
|
|
||
|
|
||
| def eval_StringContainsQ(name, string, patt, evaluation, options, matched): | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2,7 +2,6 @@ | |
| """ | ||
| Mathics3 box rendering to plain text. | ||
| """ | ||
|
|
||
| from mathics.builtin.box.graphics import GraphicsBox | ||
| from mathics.builtin.box.graphics3d import Graphics3DBox | ||
| from mathics.builtin.box.layout import ( | ||
|
|
@@ -34,6 +33,75 @@ | |
| add_render_function(FormBox, convert_inner_box_field) | ||
|
|
||
|
|
||
| # Map WMA encoding names to Python encoding names | ||
| ENCODING_WMA_TO_PYTHON = { | ||
| "WindowsEastEurope": "cp1250", | ||
| "WindowsCyrillic": "cp1251", | ||
| "WindowsANSI": "cp1252", | ||
| "WindowsGreek": "cp1252", | ||
| "WindowsTurkish": "cp1254", | ||
| } | ||
|
|
||
|
|
||
| class EncodingNameError(Exception): | ||
| pass | ||
|
|
||
|
|
||
| def get_encoding_table(encoding: str): | ||
| """ | ||
| Return a dictionary with a map from | ||
| character codes in the internal (Unicode) | ||
| representation to the request encoding. | ||
| """ | ||
| if encoding == "Unicode": | ||
| return {} | ||
|
|
||
| # In the final implementation, this should load the corresponding | ||
| # json table or an encoding file as in WMA | ||
| # SystemFiles/CharacterEncodings/*.m | ||
| # If the encoding is not available, raise an EncodingError | ||
| try: | ||
| return { | ||
| "ASCII": { | ||
| "⧴": ":>", | ||
| "⇒": "=>", | ||
| "↔": "<->", | ||
| "⩵": "==", | ||
| "∧": "&&", | ||
| "⊻": r"\[Xor]", | ||
| "≠": "!=", | ||
| "≤": "<=", | ||
| "→": "->", | ||
| "⇾": "->", | ||
| "⇾": "->", | ||
| "⇴": "->", | ||
| "∫": r"\[Integral]", | ||
| "": r"\[DifferentialD]", | ||
| "𝑑": r"\[DifferentialD]", | ||
| "⧦": r"\[Equivalent]", | ||
| "×": r" x ", | ||
| }, | ||
| "UTF-8": {}, | ||
| }[encoding] | ||
| except KeyError: | ||
| raise EncodingNameError | ||
|
|
||
|
|
||
| def encode_string_value(value: str, encoding: str): | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This function is a just a proof of concept. The final version should look into the MathicsScanner tables |
||
| """Convert an Unicode string `value` to the required `encoding`""" | ||
|
|
||
| # In WMA, encodings are readed from SystemFiles/CharacterEncodings/*.m | ||
| # on the fly. We should load them from Mathics3-Scanner tables. | ||
| encoding_table = get_encoding_table(encoding) | ||
| if not encoding_table: | ||
| return value | ||
| result = "" | ||
| for ch in value: | ||
| ch = encoding_table.get(ch, ch) | ||
| result += ch | ||
| return result | ||
|
|
||
|
|
||
| def fractionbox(box: FractionBox, **options) -> str: | ||
| # Note: values set in `options` take precedence over `box_options` | ||
| child_options = {**options, **box.box_options} | ||
|
|
@@ -159,7 +227,7 @@ def string(s: String, **options) -> str: | |
| if value.startswith('"') and value.endswith('"'): # nopep8 | ||
| if not show_string_characters: | ||
| value = value[1:-1] | ||
| return value | ||
| return encode_string_value(value, options["encoding"]) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do we know that "encoding" has always been passed in the options dictionary? Should this be
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I put this way to check that indeed we passed the option. But yes, we can use get instead. |
||
|
|
||
|
|
||
| add_render_function(String, string) | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -231,28 +231,28 @@ def test_close(): | |
| ( | ||
| 'stream = StringToStream["1.523E-19"]; Read[stream, Real]', | ||
| None, | ||
| "1.523×10^-19", | ||
| "1.523 x 10^-19", | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I thought you were going to rewrite so that we aren't testing impossible behavior? Such as testing at the encoding-independent level.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, but to do that, I need to do it in several steps, in order to avoid a much larger PRs. The plan is go over this in the next round.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How can we break up this PR into smaller, self-contained conceptual pieces?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure, because when I tried to do it in smaller pieces, I was not able to make you understand what I was doing. Maybe now I can split the part of MathML, then the encoding function, then docpipeline, then remove the internal encoding and the other fixes. |
||
| "", | ||
| ), | ||
| ("Close[stream];", None, "Null", ""), | ||
| ( | ||
| 'stream = StringToStream["-1.523e19"]; Read[stream, Real]', | ||
| None, | ||
| "-1.523×10^19", | ||
| "-1.523 x 10^19", | ||
| "", | ||
| ), | ||
| ("Close[stream];", None, "Null", ""), | ||
| ( | ||
| 'stream = StringToStream["3*^10"]; Read[stream, Real]', | ||
| None, | ||
| "3.×10^10", | ||
| "3. x 10^10", | ||
| "", | ||
| ), | ||
| ("Close[stream];", None, "Null", ""), | ||
| ( | ||
| 'stream = StringToStream["3.*^10"]; Read[stream, Real]', | ||
| None, | ||
| "3.×10^10", | ||
| "3. x 10^10", | ||
| "", | ||
| ), | ||
| ("Close[stream];", None, "Null", ""), | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -42,9 +42,9 @@ def test_ArcCos(): | |
| ("ArcTan[-1, 0]", None, "Pi", None), | ||
| ("ArcTan[0, 1]", None, "Pi / 2", None), | ||
| ("ArcTan[0, -1]", None, "-Pi / 2", None), | ||
| ("Cos[1.5 Pi]", None, "-1.83697×10^-16", None), | ||
| ("Cos[1.5 Pi]", None, "-1.83697 x 10^-16", None), | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How can I see this behavior using wolframscript? When I try using InputForm I see this: In[1]:= Cos[1.5 Pi] // InputForm
Out[1]//InputForm= -1.8369701987210297*^-16
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, you can't: InputForm uses this The closest form is the EngineeringForm: which uses the times operator. By setting the character encoding to ASCII, we see the use of
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok. Thanks for the clarification. So, in what environment do we see exactly "-1.83697 x 10^-16" as the output of "2.3*^43"? And how am I to understand that in this test, I am matching that kind of environment?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the test/format/ folder, there are two files, one is
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I am sorry that I wasn't clear. By environment, I mean in what Wolfram program or product do we see this kind of output appearing?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is because we have not yet implemented a 2D OutputForm. In any case, looking at this, what we should test in most of the cases, is the match regarding the internal representation, and not regarding the format, which is just one aspect to be tested.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Normally, I'd say yes, please let's match the internal representation. But... Unless we know that there is a WMA internal representation that we are trying to match, or unless right now this internal representation has a concrete impact on output we can see today that matches WMA, I'd say, let's not test things (let alone exhaustive tests) of something that could very well change (even if ever so slightly) as we fill things in, such as for 2D character layout.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, but that's the point. With the "internal" representation of the result, I mean
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Testing against an internal representation as a quicker, more understandable, and more trackable way to ensure that FullForm output is correct is not only fine, but it is preferable. (Of course, there will be some end-to-end "blackbox" FullForm tests as well.) Writing tests to track the internal representation for how that might be used in 2D character output that hasn't been fleshed out and is not implemented is, however, not a good idea. That should be delayed until we have the 2D renderer in place.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Logic note: In "Unless x then y", when x is false, y can happen. |
||
| ("N[Sin[1], 40]", None, "0.8414709848078965066525023216302989996226", None), | ||
| ("Tan[0.5 Pi]", None, "1.63312×10^16", None), | ||
| ("Tan[0.5 Pi]", None, "1.63312 x 10^16", None), | ||
| ], | ||
| ) | ||
| def test_trig(str_expr, msgs, str_expected, fail_msg): | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For
ToString, the default CharacterEncoding should be$CharacterEncoding