Migrating style and script tags from node bodies into Code per Node

For a recent project, I needed to migrate anything inside <script> and <style> tags that were embedded with other content inside the body field of Drupal 6 nodes into separate Code per Node-provided fields for Javascript and CSS. (Code per Node is a handy module that lets content authors easily manage CSS/JS per node/block, and saves the styles and scripts to the filesystem for inclusion when the node is rendered—read more about CPN goodness here).

The key is to get all the styles and scripts into a string (separately), then pass that data into an array in the format:

<?php
$node
->cpn = array(
 
'css' => '<string of CSS without <style> tags goes here>',
 
'js' => '<string of Javascript without <script> tags goes here>',
);
?>

Then you can save your node with node_save(), and the CSS/JS will be stored via Code per Node.

For a migration using the Migrate module, the easiest way to do this (in my opinion) is to implement the prepare() method, and put the JS/CSS into your node's cpn variable through a helper function, like so:

First, put implement the prepare() method in your migration class:

<?php
 
/**
   * Make changes to the entity immediately before it is saved.
   */
 
public function prepare($entity, $row) {
   
// Process the body and move <script> and <style> tags to Code per Node.
   
if (isset($entity->body[$entity->language][0])) {
     
$processed_info = custom_process_body_for_cpn($entity->body[$entity->language][0]['value']);
     
$entity->body[$entity->language][0]['value'] = $processed_info['body'];
     
$entity->cpn = $processed_info['cpn'];
    }
  }
?>

Then, add a helper function like the following in your migrate module's .module file (assuming your migrate module is named 'custom'):

<?php
/**
* Break out style and script tags in body content into a Code per Node array.
*
* This function uses regular expressions to grab the content inside <script>
* and <style> tags inside the given body HTML, then put them into separate keys
* in an array that can be set as $node->cpn for a node before saving, which
* will store the scripts and styles in the appropriate fields for the Code per
* Node module.
*
* Why regex? I originally tried using PHP's DOMDocument to process the HTML,
* but besides being overly verbose with error messages on all but the most
* pristine markup, DOMDocument processed tags poorly; if there were multiple
* script tags, or cases where script tags were inside certain other tags, only
* one or two of the matches would work. Yuck.
*
* Regex is evil, but in this case necessary.
*
* @param string $body
*   HTML string that could potentially contain script and style tags.
*
* @return array
*   Array with the following elements:
*     cpn: array with 'js' and 'css' keys containing corresponding strings.
*     body: same as the body passed in, but without any script or style tags.
*/
function custom_process_body_for_cpn($body) {
 
$cpn = array('js' => '', 'css' => '');

 
// Search for script and style tags.
 
$tags = array(
   
'script' => 'js',
   
'style' => 'css',
  );
  foreach (
$tags as $tag => $type) {
   
// Use a regular expression to match the tag and grab the text inside.
   
preg_match_all("/<$tag.*?>
(.*?)<\/$tag>/is", $body, $matches, PREG_SET_ORDER);
    if (!empty($matches)) {
      foreach ($matches as $match_set) {
        // Remove the first item in the set (it still has the matched tags).
        unset($match_set[0]);
        // Loop through the matches.
        foreach ($match_set as $match) {
          $match = trim($match);
          // Some tags, like script tags for embedded videos, are empty, and
          // shouldn't be removed, so check to make sure there's a value.
          if (!empty($match)) {
            // Remove the text from the body.
            $body = preg_replace("/<$tag.*?>(.*?)<\/$tag>/is", '', $body);
            // Add the tag contents to the cpn array.
            $cpn[$type] .= $match . "\r\n\r\n";
          }
        }
      }
    }
  }

  // Return the updated body and CPN array.
  return array(
    'cpn' => $cpn,
    'body' => $body,
  );
}
?>

If you were using another solution like the CSS module in Drupal 6, and need to migrate to Code per Node, your processing will be a little different, and you might need to do some work in your migration class' prepareRow() method instead. The main thing is to get the CSS/Javascript into the $node->cpn array, then save the node. The Code per Node module will do the rest.

Comments

Thanks Jeff.

You taught me something with the reference to the Migrate modules prepare() function.
This approach could be useful for many other migration / cleanups tasks.

All the best,
Guy

I came across this when I released Inline CSS Checker and got pointed by @DamienMcKenna to CPN, which has a link to this post.

Besides using regexes, another option in some cases might be to use DOMDocument and DOMXPath as I did. I'm not 100% sure if DOMDocument::saveHTML() renders the HTML the same way it read it, though, but it seems to do a decent job.

See my note in the function comments on why I used regex; unfortunately, DOMDocument seems to not enjoy parsing any less-than-stellar HTML (for example, if someone puts a <script> tag inside another <script> tag, DOMDocument will stop finding anything beyond the first bit of the first tag contents), at least in my experience.

For many (if not most) situations, DOMDocument is far superior than regex, but sometimes, you have to get your hands a bit dirty and do something the wrong way, so it works :)