Details
-
Improvement
-
Resolution: Unresolved
-
None
-
None
-
High
-
Improvement: Websites / Webapps are now downloaded with a Headless Browser and an optional configuration
-
Empty show more show less
Description
Concept
Instead of downloading a Website with WGET, a URL should be downloaded with a headless browser to be able to capture dynamically loaded content. To properly capture the website, it should be captutred with Javascript injected to the target website. There also should be a JSON configuration, that is optional and defines further processing-options for the scraping.
This configuration can be supplied as normal T5 config option (runtimeOptions.plugins.VisualReview.websiteDownloadConfiguration, type JSON) overwriteable on client and import level or it can be directly added in the ZIP as "websiteDownloadConfiguration.json" alongside the "reviewHtml.txt" for the URLs.
This JSON may has the following properties to configure the download of the website. All elements are identified by CSS-selectors. It also has options, to add custom CSS and JS into the markup of the scraped page(s) for more elaborate customizing (e.g. turning a slider into several items shown beneath each other)
[ "delay" => 10, // the delay after the DOM loaded event before we scrape "deleteElements" => ["exampleID", ".exampleClass > .otherClass"], // list of elements to delete from the markup (ele.remove()) "showElements" => ["exampleID", ".exampleClass > .otherClass"], // list of elements to set to "display:block;" "fadeinElements" => ["exampleID", ".exampleClass > .otherClass"], // list of elements to set to "opacity:100;" "customCss" => ".someCustomElement { display:block; widh:50%px; }", // custom CSS to be added as the last style-tag into the page-head "customJs" => "alert('just an example')", // custom JS to be added in a script-tag before </body> ]
After the Headless browser navigated to the given page, we wait for Page::NETWORK_IDLE (see: https://github.com/chrome-php/chrome).
If configured, we wait the additional delay. Then the elemnts to hide/show are processed according the configuration and if they exist in the DOM. Then we take a "Screenshot" (clone) of the DOM and save it to a tmp directory. Then (still in the visualconverter container) the ĺinked resources are downloaded and saved alongside the HTML (in a subfolder "resources"). The whole site packed as ZIP is returned to the visual plugin.
Technical Details
A useful Browser-size for the headless Browser must be defined: 1024x1280 (to match a typical t5 viewport)
Before scraping: we should scroll to the end of the page !
To make a clone of the page's DOM jQuery should be used (which has a powerful clone function). DOM's inbuilt cloneNode() is NOT sufficient! This is problematic, since a lot of pages may have already jQuery with them. Therefore we should use jQueries noConflict() or simply rename "jQuery" to "t5jQuery" in a adjusted lib that we then inject to the website
What must be checked/made sure
- runtime-adjusted image/media src-atributes are properly cloned
- runtime-adjusted CSS properties are properly cloned
The PHP Code to download the resources of a website (with fetching chained resources from stylesheets!) is already there and should be reused:
- editor_Plugins_VisualReview_Html_Review
- editor_Plugins_VisualReview_Html_Resources
- editor_Plugins_VisualReview_Html_Resource
Keep in mind there is a configuration, if JavaScript shall be scraped as well and is allowed to be executed in the Visual (see the classes above). The configured JS "customJs" should be injected, even if JS is not allowed by config.
Attachments
Issue Links
- blocks
-
TRANSLATE-3785 Exchange Visual PDF in ongoing project
- Selected for dev
- is blocked by
-
TRANSLATE-4008 Use noreply address as default sender
- Done