I've figured out enough of the azw3r format to extract personal highlights, notes, and maybe bookmarks. (All strictly by inspection.) I've also written a C program to extract highlights and notes (in a text format possibly most suitable as an intermediate stage) and a perl script that uses the extracted highlights and notes to mark up the rawml for the book.
As I write this up, I see that the structures are saved avl interval trees, which is meaningless to me and the results of a web search don't look interesting. This particular file is a strange mix of binary and text. (Of course the notes are in text, but see the following.
Each hightlight begins (for my purposes) with the string "annotation.personal.highlight" followed by 4 bytes. The first byte is always 0x03 (^C) followed by 3 bytes that seem to give the length of the following text string that denotes the rawml byte offset of the beginning of the highlight. This is followed by a repeat to give the byte offset of the end of the highlight, which is followed by about a couple dozen bytes of (as far as I am concerned) junk.
(0*256) + 0)*256 + 7 = 7
Personal notes are similar to highlights. They begin with the string "annotation.personal.note", followed by the rawml byte offset of the highlight associated with the note. This is followed by more "junk", then binary (only) length of the note, then the text of the note itself.
Bookmarks look similar to highlights, but I have not investigated.
The C code and perl script are in github at https://github.com/jps-e/azw3r and a
ttached here along with a sed script to make the rawml viewable in a web browser.
As I write this up, I see that the structures are saved avl interval trees, which is meaningless to me and the results of a web search don't look interesting. This particular file is a strange mix of binary and text. (Of course the notes are in text, but see the following.
Each hightlight begins (for my purposes) with the string "annotation.personal.highlight" followed by 4 bytes. The first byte is always 0x03 (^C) followed by 3 bytes that seem to give the length of the following text string that denotes the rawml byte offset of the beginning of the highlight. This is followed by a repeat to give the byte offset of the end of the highlight, which is followed by about a couple dozen bytes of (as far as I am concerned) junk.
Code:
annotation.personal.highlight^C^@^@^G1191325^C^@^@^G1191337^B^@^@^A...
3 0 0 7 3 0 0 7
Personal notes are similar to highlights. They begin with the string "annotation.personal.note", followed by the rawml byte offset of the highlight associated with the note. This is followed by more "junk", then binary (only) length of the note, then the text of the note itself.
Bookmarks look similar to highlights, but I have not investigated.
The C code and perl script are in github at https://github.com/jps-e/azw3r and a
ttached here along with a sed script to make the rawml viewable in a web browser.