[TRANSLATE-4202] Visual: Download of Websites/Webapps with Headless Browser Instead of WGET - translate5 JIRA issue tracker

Details

Type: Improvement
Resolution: Fixed
Fix Version/s: translate5 - 7.19.0
Affects Version/s: None
Component/s: VisualReview / VisualTranslation

Urgency:
High
ChangeLog Description:
Improvement: Websites / Webapps are now downloaded with a Headless Browser and an optional configuration
Checklist:

Empty

show more show less

Description

Concept

Instead of downloading a Website with WGET, a URL should be downloaded with a headless browser to be able to capture dynamically loaded content. To properly capture the website, it should be captutred with Javascript injected to the target website. There also should be a JSON configuration, that is optional and defines further processing-options for the scraping.
This configuration can be supplied as normal T5 config option (runtimeOptions.plugins.VisualReview.websiteDownloadConfiguration, type JSON) overwriteable on client and import level or it can be directly added in the ZIP as "websiteDownloadConfiguration.json" alongside the "reviewHtml.txt" for the URLs.

This JSON may has the following properties to configure the download of the website. All elements are identified by CSS-selectors. It also has options, to add custom CSS and JS into the markup of the scraped page(s) for more elaborate customizing (e.g. turning a slider into several items shown beneath each other)

[
    "windowSize" => [1280, 1536], // the browser's window size
    "delay" => 10, // the delay after the DOM loaded event before we scrape
    "deleteElements" => ["exampleID", ".exampleClass > .otherClass"], // list of elements to delete from the markup (ele.remove())
    "showElements" =>  ["exampleID", ".exampleClass > .otherClass"], // list of elements to set to "display:block;"
    "fadeinElements" =>  ["exampleID", ".exampleClass > .otherClass"], // list of elements to set to "opacity:100;"
    "customCss" =>  ".someCustomElement { display:block; widh:50%px; }", // custom CSS to be added as the last style-tag into the page-head
    "customJs" => "alert('just an example')", // custom JS to be added in a script-tag before </body>
]

After the Headless browser navigated to the given page, we wait for Page::NETWORK_IDLE (see: https://github.com/chrome-php/chrome).
If configured, we wait the additional delay. Then the elemnts to hide/show are processed according the configuration and if they exist in the DOM. Then we take a "Screenshot" (clone) of the DOM and save it to a tmp directory. Then (still in the visualconverter container) the ĺinked resources are downloaded and saved alongside the HTML (in a subfolder "resources"). The whole site packed as ZIP is returned to the visual plugin.

Technical Details

A useful Browser-size for the headless Browser must be defined: 1024x1280 (to match a typical t5 viewport)
Before scraping: we should scroll to the end of the page !

To make a clone of the page's DOM jQuery should be used (which has a powerful clone function). DOM's inbuilt cloneNode() is NOT sufficient! This is problematic, since a lot of pages may have already jQuery with them. Therefore we should use jQueries noConflict() or simply rename "jQuery" to "t5jQuery" in a adjusted lib that we then inject to the website

What must be checked/made sure

runtime-adjusted image/media src-atributes are properly cloned
runtime-adjusted CSS properties are properly cloned

The PHP Code to download the resources of a website (with fetching chained resources from stylesheets!) is already there and should be reused:

editor_Plugins_VisualReview_Html_Review
editor_Plugins_VisualReview_Html_Resources
editor_Plugins_VisualReview_Html_Resource

Keep in mind there is a configuration, if JavaScript shall be scraped as well and is allowed to be executed in the Visual (see the classes above). The configured JS "customJs" should be injected, even if JS is not allowed by config.

Attachments

Issue Links

blocks

TRANSLATE-3785 Exchange Visual PDF in ongoing project

In Progress

is blocked by

TRANSLATE-4008 Use noreply address as default sender

Done

Activity

Loading...

Details

Type: Improvement
Resolution: Fixed
Fix Version/s: translate5 - 7.19.0
Affects Version/s: None
Component/s: VisualReview / VisualTranslation

Urgency:
High
ChangeLog Description:
Improvement: Websites / Webapps are now downloaded with a Headless Browser and an optional configuration
Checklist:

Empty

show more show less

Description

Concept

Instead of downloading a Website with WGET, a URL should be downloaded with a headless browser to be able to capture dynamically loaded content. To properly capture the website, it should be captutred with Javascript injected to the target website. There also should be a JSON configuration, that is optional and defines further processing-options for the scraping.
This configuration can be supplied as normal T5 config option (runtimeOptions.plugins.VisualReview.websiteDownloadConfiguration, type JSON) overwriteable on client and import level or it can be directly added in the ZIP as "websiteDownloadConfiguration.json" alongside the "reviewHtml.txt" for the URLs.

This JSON may has the following properties to configure the download of the website. All elements are identified by CSS-selectors. It also has options, to add custom CSS and JS into the markup of the scraped page(s) for more elaborate customizing (e.g. turning a slider into several items shown beneath each other)

[
    "windowSize" => [1280, 1536], // the browser's window size
    "delay" => 10, // the delay after the DOM loaded event before we scrape
    "deleteElements" => ["exampleID", ".exampleClass > .otherClass"], // list of elements to delete from the markup (ele.remove())
    "showElements" =>  ["exampleID", ".exampleClass > .otherClass"], // list of elements to set to "display:block;"
    "fadeinElements" =>  ["exampleID", ".exampleClass > .otherClass"], // list of elements to set to "opacity:100;"
    "customCss" =>  ".someCustomElement { display:block; widh:50%px; }", // custom CSS to be added as the last style-tag into the page-head
    "customJs" => "alert('just an example')", // custom JS to be added in a script-tag before </body>
]

After the Headless browser navigated to the given page, we wait for Page::NETWORK_IDLE (see: https://github.com/chrome-php/chrome).
If configured, we wait the additional delay. Then the elemnts to hide/show are processed according the configuration and if they exist in the DOM. Then we take a "Screenshot" (clone) of the DOM and save it to a tmp directory. Then (still in the visualconverter container) the ĺinked resources are downloaded and saved alongside the HTML (in a subfolder "resources"). The whole site packed as ZIP is returned to the visual plugin.

Technical Details

A useful Browser-size for the headless Browser must be defined: 1024x1280 (to match a typical t5 viewport)
Before scraping: we should scroll to the end of the page !

To make a clone of the page's DOM jQuery should be used (which has a powerful clone function). DOM's inbuilt cloneNode() is NOT sufficient! This is problematic, since a lot of pages may have already jQuery with them. Therefore we should use jQueries noConflict() or simply rename "jQuery" to "t5jQuery" in a adjusted lib that we then inject to the website

What must be checked/made sure

runtime-adjusted image/media src-atributes are properly cloned
runtime-adjusted CSS properties are properly cloned

The PHP Code to download the resources of a website (with fetching chained resources from stylesheets!) is already there and should be reused:

editor_Plugins_VisualReview_Html_Review
editor_Plugins_VisualReview_Html_Resources
editor_Plugins_VisualReview_Html_Resource

Keep in mind there is a configuration, if JavaScript shall be scraped as well and is allowed to be executed in the Visual (see the classes above). The configured JS "customJs" should be injected, even if JS is not allowed by config.

Attachments

Issue Links

blocks

TRANSLATE-3785 Exchange Visual PDF in ongoing project

In Progress

is blocked by

TRANSLATE-4008 Use noreply address as default sender

Done

Activity

People

Assignee:: Ann Ali (Inactive)

Reporter:: Axel Becher

Peer developer:: Axel Becher, Leon Kiz

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 23/Sep/2024 06:15

Updated:: 07/Feb/2025 05:19

Resolved:: 07/Feb/2025 05:19