[TRANSLATE-4370] OpenAI GPT Training Improvements: Data Model - translate5 JIRA issue tracker

Details

Type: Sub-task
Resolution: Unresolved
Fix Version/s: None
Affects Version/s: None
Component/s: openai

Urgency:
High
ChangeLog Description:
added database tables to store predefined prompts data
Checklist:

Empty

show more show less

Description

Enhanced OpenAI training Data-Model

Currently LEK_openai_finetunejob is the only data-table, this must be changed to seperate system-msgs from exmples from finetune-jobs. We still capture the training as a file though for reference.
My idea is to NOT to save examples nor system-messages line-by-line but to keep the examples in a "holistic" model as I think there will be no need to have the single examples as individual entities. Searching can be accomplished with MySQLs JSON functions when needed and performance is not a relevant factor here. All created & lastChange fields are PHP-timestamps

Added tables:

LEK_openai_sysmessage

Holds the system-messages, the field "lang" will always be "en" currently. The json may includes one or several sys-messages, which may consist of several sentences each

    columns: ( id | lang | json | name | comment | created | lastChange )
    json: [
        { "message": "Just an example system message" }
         ...
    ]

LEK_openai_exampleset

Holds the examples as source and target strings. 1:n connection to LEK_openai_sysmessage. Source and target language generally can be with and/or without country. The "isComplete" flag is calculated after edit and represents, if all source-texts are translated. For a training, only translated lines are used.

    columns: ( id | sysMessageId | sourceLang | targetLang | json | comment | created | lastChange | isComplete )
    json: [
        { "source": "This is example 1", "target" : "Das ist Beispiel 1" }
        { "source": "This is example 2", "target" : "Das ist Beispiel 2" }
        ...
    ]

LEK_openai_finetunejob

column "conversation" will NOT be replaced with the associated sys-messages, because 'conversation' column will contain sysmessages and examples used at the point of time where training happened so 'conversation' will contain the history, and the sysmessages and examples in LEK_openai_sysmessage and LEK_openai_exampleset tables, respectively - might evolve and improve over time. So, each time training is submitted - the data from prompts which are added to training - is converted to 'conversation' and stored there.

Attachments

Activity

People

Assignee:: Pavel Perminov

Reporter:: Axel Becher

Peer developer:: Axel Becher

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 13/Jan/2025 06:26

Updated:: 27/Feb/2025 10:05