You are viewing lionet

Previous Entry | Next Entry


Note: The obligatory TL;DR section is at the very end of this text.

The ASN.1 rant

After co-founding Echo, I had to put the asn1c's development on hold, for the sheer lack of time. (If you don't know what my asn1c is, think of it as the most evolved open source ASN.1 compiler.) Despite suspending development, I've been tracking the ASN.1 evolution, as well as the emergence of some newer technologies competing with what ASN.1 has to offer. I am referring to FaceBook's Thrift, Google's Protocol Buffers, Cisco's Etch, and the likes. Yet to this day I had no opportunity to actually use any of those in production.

I have my own personal score with the ASN.1 world. The standard is laden with design-by-committee complexity, no doubt evolved to “address the real world demands”. Its mind-numbing standardese and semantics effectively prohibit any newcomer from ever entering this field and producing a decent new compiler. So, we're stuck with something like 2.5 alive ASN.1 compilers covering some 3 mainstream languages (C++, C#, Java). The commercial products are often cost prohibitive: they're squarely aimed at rich telecom market. Where would a small Ruby or Python startup go? (While I am at it, Erlang has a free decent compiler, you know).

Yet, there's an opposite side to this complexity. Many things you struggle with or “invent” for the purpose of better data serialization have already been invented in the ASN.1 world. Things like broiled-to-perfection TLV-based encodings (BER/DER/CER), bitwise Packed Encoding Rules (competing with gzip'ing your serialized binary blob), Information Object Classes (think of SNMP MIB macros on steroids), or Encoding Control Notation have a lot to offer and learn from.

But I should stop kicking that dead horse. Let's try Thrift for a change.

Thrift

0. Selecting Thrift for OCaml

Two days ago I started using Thrift for a real, production task. My first impression is that [the marshaling part of] the Thrift framework is severely underspecified, and the target language generators are not consistent in their treatment of the “standard” between each other. Here's what I found while attempting to build Thrift into our OCaml code base. Some pieces are just my initial observations, and some of them warrant further discussion.

lyelik selected Thrift for our next project based on language support matrix. We needed safe serialization mechanism for OCaml and it seemed that Thrift would offer us some help with that.

Why would we need Thrift instead of the built-in Marshal functionality? The format produced by the OCaml's Marshal module lacks versioning and its implementation is unsafe. Marshal.from_channel crashes the process when the consumer attempts to read a serialized message created by a different version of the producer.

So, off we went and created a module for our OCaml serializing needs. Let's name it test.thrift for the purposes of this article (Point 1):
// test.thrift
struct MyStruct {
  1:           string myNoReq,
  2:           string myNoReqDef = "noreqdef",
  3:  optional string myOpt,
  4:  optional string myOptDef = "optdef",
  5:  required string myReq,
  6:  required string myReqDef = "reqdef",
  7:     list<string> myListDefEmpty = [],
  8:     list<string> myListDef = ["listElement"],
  9:  required double myDblDef0 = 0.0,
 10:  required double myDblDefPi = 3.1415
}
The first issue that you notice is related to “optional” vs. “required” vs. absent field requiredness.

1. Problems with requiredness

Requiredness is a term used in the ThriftIDL. The whitepaper which introduced Thrift does not talk about requiredness flag at all. It appears that the requiredness is “defined by implementation” and never formally described. Because of that, there are discrepancies in treating requiredness flags between language generators.

1.1. Treatment of absent (default) requiredness

The first problem is treatment of the absent requiredness. It looks like a binary toggle (“optional” or “required”) with some default, and you just want to figure out what the default is. Let's make an experiment. Let's compile the test.thrift into C++, Java, and PHP to check out whether the code generated for myNoReq field is similar to the code for myOpt or myReq fields.

LanguagemyNoReqmyOptmyReq
C++
 // __isset member:
bool myNoType;
// comparison operator:
if (!(myNoType == rhs.myNoType))
  return false;
// write: happens always
// __isset member:
bool myOpt;
// comparison operator:
if (__isset.myOpt != rhs.__isset.myOpt)
  return false;
else if (__isset.myOpt && !(myOpt == rhs.myOpt))
  return false;
// write:
if (this->__isset.myOpt) { … }
// Not part of __isset.
// comparison operator:
if (!(myReq == rhs.myReq))
  return false;
// write: happens always
Java
// write():
if (this.myNoReq != null) {
  oprot.writeFieldBegin(MY_NO_REQ_FIELD_DESC);
  oprot.writeString(this.myNoReq);
  oprot.writeFieldEnd();
}
// write():
if (this.myOpt != null) {
  // Yes, they do it twice:
  if (this.myOpt != null) {
    oprot.writeFieldBegin(MY_OPT_FIELD_DESC);
    oprot.writeString(this.myOpt);
    oprot.writeFieldEnd();
  }
}
// write():
if (this.myReq != null) {
  oprot.writeFieldBegin(MY_REQ_FIELD_DESC);
  oprot.writeString(this.myReq);
  oprot.writeFieldEnd();
}
// validate(): check for required fields
if (myReq == null) {
  throw new TProtocolException(
  "Required field 'myReq' was not present!"
  + " Struct: " + toString());
}
PHP
public $myNoReq = null;
public $myOpt = null;
public $myReq = null;


The experiment shows that the absent requiredness flag:
  • means neither “optional” nor “required” for C++,
  • does not do much for Java, except removing some code duplication compared to “optional”, and
  • does not affect the code at all for PHP.
By the way, the generated PHP code is about 3.5 times smaller than Java code. The difference in the amount of code could be explained the simplistic parsing approach employed by PHP… if not for C++, which is also about 3.5 times smaller than Java, while having more advanced semantics than both Java and PHP. For reference: 317 lines PHP vs. 352 lines C++ vs. 1190 lines Java.

The Thrift compiler source code specifies the third kind of requiredness, namely T_OPT_IN_REQ_OUT, which is not documented properly, but used in a fancy way at least for C++ code generation. In fact, the best description of the field comes from the JIRA-455 issue, still open after a year:
Jonathan Ellis added a comment - 14/Aug/09 11:21 AM
I have no idea what this means, but having fields that are neither required nor optional and get special cased differently in different places of the code is broken.
If you really want to have 3 states then 2 of them should not be called "required" and "optional" because common sense implies that if a field is not one it must be the other.

David Reiss added a comment - 14/Aug/09 01:29 PM
That is not going to happen. In C++, optional fields require applications to manually set __isset fields, otherwise they are not serialized on writing. Required fields throw an exception if they are absent. The default has neither of these issues. It is the most sensible choice when working with C++. […]

Note that if the field is required and not set, the C++ throws an exception and PHP happily encodes no value for that field. What happens when the C++ decides to read the blob encoded by PHP in this fashion? It will throw TProtocolException::INVALID_DATA. On the other hand, the PHP receiver will parse it just fine. Meet implementation discrepancy.

I believe that coming up with, maintaining, and defending this kind of poor naming and semantic discrepancy is the first step to perdition. If Thrift continues to evolve this way, it will turn into another ASN.1-like nightmare, mark my words.

1.2. Relating requiredness semantics to problem domains

In the absence of a formal definition, the application programmer can easily fall into a trap of presuming that requiredness field flag can be used to model concept from a problem domain, describing a field that your problem domain really needs in a particular context. For example, consider the following code, where requiredness allows you to specify that you cannot possibly know the host name, but may safely assume the default port.
struct targetWebSystem {
  required string host_name,
  optional int32  host_port = 80,
}
Such definition could be very helpful if the compiler was to generate code which checks whether host_name is present in the stream being parsed. But as we've seen, the PHP deserializer is not affected by the requiredness flag at all. Since requiredness isn't discussed in the original whitepaper, and otherwise being kept underspecified, the Thrift ecosystem is losing a valuable mechanism for defining the domain-specific guarantee. Without proper documentation describing this feature, it would not occur to the person implementing a new Thrift generator for yet another language that the requiredness is something that you must check even if your target language allows lax attitude towards underspecified values!

I believe the requiredness flag should be useful in modeling the aspects of a problem domain, whereas right now it appears to be too protocol-centric (describing what should be on the wire), inconsistent in treatment by the target language generators, and thus confusing. At the very least, it should be properly described in the documentation rather than JIRA issue.

2. There is a problem with default values

2.1. Semantics of default values is non-existent or worse

Have a look at the MyStruct definition once again (see Point 1) and note the myNoReqDef, myOptDef, and myReqDef fields. What would you think the purpose giving them their default values ("noreqdef", "optdef", "reqdef")?

The purpose is not described in the whitepaper! In fact, the default values are introduced only by example, and are never formally defined.

One could think that specifying a default value tells the receiver that in the absence of the field value in the received blob (if omitted during encoding), the receiver should assume the field as having the default value.

There is not much stretch in this assumption. In fact, this is how ASN.1 describes the default values: the “OPTIONAL” values turning into default when there's no corresponding data on the wire. For example, consider the following ASN.1 structure:
MyStruct ::= SEQUENCE {
  fldOpt    String OPTIONAL,         -- If nothing is on the wire, asnOpt is not defined (NULL?)
  fldOptDef String DEFAULT "optdef", -- If nothing is on the wire, the asnOptDef is "optdef"
  fldReq    String                   -- This one must always be present on the wire.
}
There is a problem with translating this thinking into the Thrift world. Unlike ASN.1, in Thrift default values are defined for all three types of requiredness (“required”, “optional”, absent).

What is the purpose of specifying a default value for a “required” field?

To allow the structure to be fully valid even if deserialization didn't find the corresponding value on the wire? The C++ code is not consistent with it: if you try to deserialize a blob, which does not contain the value of a required field, the decoder will throw INVALID_DATA exception.

Perhaps, allow the structure to be serializable even if the structure owner didn't manually set these required fields ((new MyStruct)->write())? Nice, but hold that thought for a moment (Point 2).

What is the purpose of specifying a default value for an “optional” field?

To allow the structure to think that it's got a value of an optional field off the wire if there was no value for that field on the wire? The C++ code is not consistent with this: it would not treat the optional field as set if the value didn't specifically came off the wire. So, default value set in the class constructor for an optional field is semantically very different from the default value taken off the wire. What's the point in this very fine distinction? Perhaps, this distinction is used to allow for a round-trip: if the value is never decoded off the wire, it won't be placed on the wire during encoding. The new->read()->write() sequence would not put the default value of an optional field to the output stream (Point 3).

But hey, remember I told you to hold the thought (see Point 2)? The discrepancy is that for “required” fields the default value is going to appear on the wire during new->write() sequence, whereas for “optional” field it won't.

That's not a subtle mess. It very much looks like that the semantics of the requiredness is not orthogonal to the semantics of a default value specification. The ASN.1 kept this mess a bit more honestly exposed, at least, by lumping these things into a three-way toggle (required, optional, and optional-with-default-value)!

2.2. Default values cannot exercise their power

The ASN.1 got the other thing right. When DEFAULT was specified (remember, DEFAULT = OPTIONAL + default value), some encodings (DER) mandated that the value of the field which is equal to the default value is never encoded on the wire. This would allow very efficient encoding of extensive structures with lots of rarely changing fields. To give you an extreme example, a structure with two hundred fields all of which keep their default values (however complex) can be encoded in about two bytes.

You might argue that in Thrift this goes the same way: an “optional” field having a default value is never encoded (see Point 3). Well, this is not true at least for Java — an “optional” field having a default value is always encoded. Even in C++, if you set the value of a field manually, the implementation will encode it ignoring the fact that it might bear the default value. So, this is not even close to the ASN.1 DEFAULT semantics for DER.

The main difference in the approach is that ASN.1 specified semantics for the syntax, whereas in Thrift the default values are introduced to the definition language with no apparent or mandated semantics. The ASN.1 specifying tight semantics for any language construct fosters interoperability between conforming implementations. In the absence of formal definition, the Thrift generators choose to treat default values as largely pointless comments, and certainly do not realize their full potential.

The right approach here would be to:
  • Require that “optional” fields with default values are treated as set by default, unless explicitly unset by the user. This would be consistent with current Java and PHP semantics.
  • Allow skipping encoding of an actual value of an “optional” field with a default value if the actual value is structurally the same as the default value.
In addition to that, implementations need to make sure that “optional” fields with default values are explicitly marked set and end up assigned default values in case the corresponding value isn't found during the read() operation. Right now default values may be lost by executing a corrupting read() followed by another read() — the values are not reset to their default values prior to performing the deserialization.

3. The OCaml generators are lagging behind


I can speak for the other language generators, such as Haskell, PHP, but I am more interested in OCaml at the moment.

3.1. There is no account for requiredness flag or default values

Every member of the OCaml structure generated from MyStruct (see Point 1) is defined as "<type> option" (or "Maybe <type>" for you Haskellers). Here's how:
class myStruct =
object (self)
  val mutable _myNoType : string option = None
  val mutable _myNoTypeDef : string option = None
  val mutable _myOpt : string option = None
  val mutable _myOptDef : string option = None
  val mutable _myReq : string option = None
  val mutable _myReqDef : string option = None
  val mutable _myEmptyListDef : string list option = None
  val mutable _myListDef : string list option = None
  val mutable _myDblDef0 : float option = None
  val mutable _myDblDefPi : float option = None
  method grab_myNoReq = match _myNoReq with
    | None -> raise (Field_empty "myStruct.myNoReq")
    | Some _x -> _x
  method set_myNoReq _x = _myNoType <- Some _x
  …
end
It is rather good that all the member methods are consistent in that #get_<type> returns <type> option and #grab_<type> returns <type>, generating an exception if the field is not set. However, it's rather bad that we can't specify in the Thrift file that a particular field is always set and has such and such default value. Instead, since the OCaml generator does treat default values as useless documentation, we must be very careful to #grab_<type> each time, even with the fields marked as “required”. The absence of a static guarantee that a function would not throw is unnerving.

The “optional” and/or “required” fields with default values, on the other hand, need to be defined without that option/Maybe wrapping. Whereas I advocate changing the types of fields, I see a merit in retaining the uniform access to them through get_x/grab_x methods with conveniently uniform signatures.

But which requiredness should be transformed into a wrapper-less field, addressable without Some/None constructors and statically guaranteed to be exception-free? If we make a “required” field with specified default value wrapper-less, then we lose ability to detect whether the wire contained the value for such field. Thus, we lose protocol-centric semantics for the “required” field (that is, semantics which requires that the received wire encoding contains the value for a specific field). However, we gain the problem-domain-centric semantics, which is "make sure this field is never unset, for a peace of mind".

With the problem-domain-centric semantics we don't really need to always make sure that the field value is always present on the wire: if it is not present, it will be assigned the default value. See where I am going with that? There is no difference between “required”, “optional” and absent requiredness in case the field has a default value and we drop the notion of protocol-centric semantics for the requiredness. This should have been thought through before being put in the Thrift…

For the sake of clarity, we could have defined the field format as in the following simplified BNF:
  Field ::= RequiredField | OptionalField | DefaultValuedField
  RequiredField ::= "required" FieldType
  OptionalField ::= "optional" FieldType
  DefaultValuedField ::= FieldType "=" DefaultValue

This would ease the pain of default value and requiredness not being orthogonal, retain some of the protocol-centric semantics for requiredness, and make fields with default values do the right thing. But maybe I am missing something obvious with this scheme, so help me.

I think that “required” and “optional” fields with default values do not have meaningful difference, at least for OCaml code, if we don't consider backward compatibility with existing C++ code base.

To maintain compatibility with C++, the requiredness flag can be used to distinguish between always sending the value (for “required” fields) and avoiding sending actual values equal to default (for “optional” fields).

3.2. Patch for the OCaml generator

I've prepared an OCaml generator patch against the current Thrift SVN trunk. The generated code:
  • makes unwrapped types for all fields for which default values are defined,
  • honors the protocol-centric semantics for the “required” field: such fields are always serialized,
  • throws exception instead of skipping “required” fields which do not have a set value,
  • avoids encoding values of “optional” and absent-requiredness fields if the field value is structurally equivalent to the field's default value,
  • does not break compatibility with C++ receivers,
  • is drop-in compatibe with OCaml code generated by the OCaml generator in the unmodified trunk

Let's see what the newly patched Thrift compiler generates:

OldPatched
class myStruct =
object (self)
  val mutable _myNoType : string option = None
  val mutable _myNoTypeDef : string option = None
  val mutable _myOpt : string option = None
  val mutable _myOptDef : string option = None
  val mutable _myReq : string option = None
  val mutable _myReqDef : string option = None
  val mutable _myEmptyListDef : string list option = None
  val mutable _myListDef : string list option = None
  val mutable _myDblDef0 : float option = None
  val mutable _myDblDefPi : float option = None

  method grab_myNoReq = match _myNoReq with
    | None -> raise (Field_empty "myStruct.myNoReq")
    | Some _x -> _x
  method set_myNoReq _x = _myNoType <- Some _x

  method get_myOpt = _myOpt
  method grab_myOpt = match _myOpt with
    | None->raise (Field_empty "myStruct.myOpt")
    | Some _x -> _x
  method set_myOpt _x = _myOpt <- Some _x

  // Serializing
  (match _myOptDef with None -> () | Some _v -> 
    oprot#writeFieldBegin("myOptDef",Protocol.T_STRING,4);
    oprot#writeString(_v);
    oprot#writeFieldEnd
  );
  // Unsafe if myReq is not set!
  (match _myReq with None -> () | Some _v -> 
    oprot#writeFieldBegin("myReq",Protocol.T_STRING,5);
    oprot#writeString(_v);
    oprot#writeFieldEnd
  );
  // Unsafe if myReqDef is not set!
  (match _myReqDef with None -> () | Some _v -> 
    oprot#writeFieldBegin("myReqDef",Protocol.T_STRING,6);
    oprot#writeString(_v);
    oprot#writeFieldEnd
  );

  …
class myStruct =
object (self)
  val mutable _myNoReq : string option = None
  val mutable _myNoReqDef : string = "noreqdef"
  val mutable _myOpt : string option = None
  val mutable _myOptDef : string = "optdef"
  val mutable _myReq : string option = None
  val mutable _myReqDef : string = "reqdef"
  val mutable _myListDefEmpty : string list = []
  val mutable _myListDef : string list = ["listElement"]
  val mutable _myDblDef0 : float = 0.0
  val mutable _myDblDefPi : float = 3.1415

  method grab_myNoReq = match _myNoReq with
    | None -> raise (Field_empty "myStruct.myNoReq")
    | Some _x -> _x
  method set_myNoReq _x = _myNoType <- Some _x

  method get_myOptDef = Some _myOptDef
  method grab_myOptDef = _myOptDef
  method set_myOptDef _x = _myOptDef <- _x


  // Serializing
  (match _myOptDef with "optdef" -> () | _v -> 
    oprot#writeFieldBegin("myOptDef",Protocol.T_STRING,4);
    oprot#writeString(_v);
    oprot#writeFieldEnd
  );
  (match _myReq with 
  | None -> raise (Field_empty "myStruct._myReq")
  | Some _v -> 
    oprot#writeFieldBegin("myReq",Protocol.T_STRING,5);
    oprot#writeString(_v);
    oprot#writeFieldEnd
  );
  (
    oprot#writeFieldBegin("myReqDef",Protocol.T_STRING,6);
    oprot#writeString(_myReqDef);
    oprot#writeFieldEnd
  );

  …

Note that the new code treats “required” and “optional” fields with default values differently, accounting for the semantic differences.

You can download the patch here: http://lionet.info/patches/thrift-trunk-962854.patch

4. Obligatory TL;DR section

Thrift specification underspecifies several important aspects of the description language semantics. The Thrift target language code generators are inconsistent in the way they treat certain parts of the specification. I made an attempt to make the OCaml generator produce a bit safer and compliant code, and am sharing a patch with you.
https://issues.apache.org/jira/browse/THRIFT-827
https://issues.apache.org/jira/browse/THRIFT-860

P.S. The above patches have since been merged into Thrift.

Tags:

Comments

( 8 comments — Leave a comment )
inv2004
Jul. 19th, 2010 07:44 am (UTC)
Помню ASN.1 на ocaml в итоге пришлось разбирать руками. это так, к слову.
lionet
Jul. 19th, 2010 07:46 am (UTC)
У меня есть в планах добавить OCaml поддержку в asn1c.
vorushin
Jul. 19th, 2010 05:02 pm (UTC)
"The format produced by the OCaml's Marshal module lacks versioning and is the implementation is unsafe"

and its implementation is unsafe?
lionet
Jul. 19th, 2010 11:17 pm (UTC)
10x
zerthurd
Jul. 19th, 2010 05:24 pm (UTC)
Полгода назад начал писать компилятор ASN.1 (на ОЦамле), сделал пока только создание AST файла (задолбался парсер набивать). В планах было сделать первоначально генерацию OCaml-кода. Разбирает почти все, доступные мне, файлы ASN.1 от разных стандартов.
wizzard0
Jul. 19th, 2010 05:57 pm (UTC)
Познавательно. Будем знать.
(Deleted comment)
lionet
Jul. 21st, 2010 12:23 am (UTC)
Протобуфс и ASN.1 вон вообще в инкубатор не залезали, и это им не мешает.

Цепляние тэга к машралу чревато тем, что при выпуске нового кода на продакшн систему придётся не просто апгрейдить, а инвалидировать все терабайты в мемкеше. Понятно, что этот мгновенно убивает продакшн, а заодно убивает смысл в такого типа версионности.

Почему апгрейдить не получится? Потому что десериализация несовместимых данных через модуль Marshal просто приводит к SIGSEGV, и ничего ты с этим не сделаешь.

json-static приятный, да. Но с ним на других языках не поработаешь. А хотелось бы из одной-единственной спецификации порождать код на нескольких языках, а не использовать разные инструменты для парсинга одной и той же схемы данных, к тому же отсутствующей где либо в явном, единственном и непротиворечивом виде. Thrift это решает.
(Deleted comment)
lionet
Jul. 23rd, 2010 01:30 pm (UTC)
Я совсем не понял, где там "остаются функции of_v1 of_v2 of_v3"?
( 8 comments — Leave a comment )