<?xml version="1.0" encoding="UTF-8"?> <rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" ><channel><title>Kevin Deldycke &#187; regexp</title> <atom:link href="http://kevin.deldycke.com/tag/regular-expression/feed/" rel="self" type="application/rss+xml" /><link>http://kevin.deldycke.com</link> <description>Free software engineer &#38; wannabe videomaker</description> <lastBuildDate>Fri, 03 Feb 2012 19:08:27 +0000</lastBuildDate> <language>en</language> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <generator>http://wordpress.org/?v=3.3.1</generator> <item><title>How I Open-Sourced an Internal Corporate Project (WebPing)</title><link>http://kevin.deldycke.com/2011/08/how-open-source-an-internal-corporate-project-webping/</link> <comments>http://kevin.deldycke.com/2011/08/how-open-source-an-internal-corporate-project-webping/#comments</comments> <pubDate>Tue, 30 Aug 2011 10:18:01 +0000</pubDate> <dc:creator>Kev</dc:creator> <category><![CDATA[English]]></category> <category><![CDATA[CLI]]></category> <category><![CDATA[Git]]></category> <category><![CDATA[GitHub]]></category> <category><![CDATA[Linux]]></category> <category><![CDATA[Perl]]></category> <category><![CDATA[Python]]></category> <category><![CDATA[regexp]]></category> <category><![CDATA[Subversion]]></category> <category><![CDATA[trac]]></category> <category><![CDATA[webping]]></category><guid isPermaLink="false">http://kevin.deldycke.com/?p=3749</guid> <description><![CDATA[2 weeks ago I released WebPing. This article is more or less the same I wrote 4 months ago when I released the FTT project and needed to move it from SVN to Git. But this time I added more &#8230; <a href="http://kevin.deldycke.com/2011/08/how-open-source-an-internal-corporate-project-webping/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description> <content:encoded><![CDATA[<p><a href="http://kevin.deldycke.com/2011/08/webping-open-sourced/">2 weeks ago I released WebPing</a>. This article is more or less the same I wrote 4 months ago when I <a href="http://kevin.deldycke.com/2011/03/feed-tracking-tool-released-open-source-license/">released the FTT project</a> and needed to <a href=http://kevin.deldycke.com/2011/04/ftt-migration-subversion-git/>move it from SVN to Git</a>. But this time I added more details on how I removed all sensible informations that were hard-coded in the project files.</p><h2>Subversion to Git migration</h2><p>Everything starts out of a local copy of the Subversion repository that was hosting the WebPing project since its inception:</p><pre class="brush: bash; title: ; notranslate">
$ rm -rf ./svn-repository-copy
$ tar xvzf ./svn-repository-copy.tar.gz
$ kill `ps -ef | grep svnserve | awk '{print $2}'`
$ svnserve --daemon --listen-port 3690 --root ./svn-repository-copy
</pre><p>Let&#8217;s initialize a Git repository:</p><pre class="brush: bash; title: ; notranslate">
$ rm -rf ./webping-git
$ mkdir ./webping-git
$ cd ./webping-git
$ git init
$ git commit --allow-empty -m 'Initial commit'
$ git tag &quot;init&quot;
</pre><p>We now migrate the code from Subversion to Git:</p><pre class="brush: bash; title: ; notranslate">
$ git svn init --no-metadata --username deldycke svn://localhost:3690
$ git svn fetch
$ git rebase --onto git-svn master
$ rm -rf ./.git/svn/
$ rm -rf ./.git/refs/original/
$ git reflog expire --all
$ git gc --aggressive --prune
</pre><h2>Removing unrelated files and folders</h2><p>As WebPing was not alone in the original Subversion repository, we need to clean up the latter and only keep code of the former. Worse, WebPing didn&#8217;t started its life in a dedicated subfolder, but as a tool of another project, and jumped from folders to folders. After identifying in the history all places were WebPing lived once, I came up with this big, convoluted command line to do the cleaning:</p><pre class="brush: bash; title: ; notranslate">
$ git filter-branch --force --prune-empty --tree-filter 'find ./ -not -ipath &quot;*webping*&quot; -and -not -path &quot;./other-project/trunk/tools/web-ping*&quot; -and -not -path &quot;./other-project/trunk/tools&quot; -and -not -path &quot;./other-project/trunk&quot; -and -not -path &quot;./other-project&quot; -and -not -path &quot;./.git*&quot; -and -not -path &quot;./&quot; | xargs rm -rf' -- --all
</pre><p>Strangely enough, my <code>init</code> tag went of after the command above. So I had to rebased it to get it in line:</p><pre class="brush: bash; title: ; notranslate">
$ git rebase init master
</pre><p>We can now remove SVN tags and branches, get rid of the imported <code>git-svn</code> branch, and clean up our Git repository:</p><pre class="brush: bash; title: ; notranslate">
$ git filter-branch --force --prune-empty --tree-filter 'find -path &quot;./WebPing/tags*&quot; | xargs rm -rf' -- --all
$ git filter-branch --force --prune-empty --tree-filter 'find -path &quot;./WebPing/branches*&quot; | xargs rm -rf' -- --all
$ git branch -r -D git-svn
$ rm -rf ./.git/svn/
$ rm -rf ./.git/refs/original/
$ git reflog expire --all
$ git gc --aggressive --prune
</pre><p>If I now only have WebPing code in the repository, it still jumps through the history between these following locations:</p><ul><li><code>other-project/trunk/tools/web-ping.py</code></li><li><code>other-project/trunk/tools/web-ping/</code></li><li><code>WebPing/trunk/</code></li></ul><p>Using a series of <code>git filter-branch</code> invocations, I managed to move everything to the root of the repository:</p><pre class="brush: bash; title: ; notranslate">
$ git filter-branch --force --prune-empty --tree-filter 'test -d ./other-project/trunk/tools &amp;&amp; cp -axv ./other-project/trunk/tools/* ./ &amp;&amp; rm -rf ./other-project/trunk/tools || echo &quot;No tools folder found&quot;' -- --all
$ git filter-branch --force --prune-empty --tree-filter 'test -d ./other-project/trunk/tools/web-ping &amp;&amp; cp -axv ./other-project/trunk/tools/web-ping/* ./ &amp;&amp; rm -rf ./other-project/trunk/tools/web-ping || echo &quot;No web-ping folder found&quot;' -- --all
$ git filter-branch --force --prune-empty --tree-filter 'test -d ./WebPing/trunk &amp;&amp; cp -axv ./WebPing/trunk/* ./ &amp;&amp; rm -rf ./WebPing/trunk || echo &quot;No trunk folder found&quot;' -- --all
</pre><h2>Hide and obfuscate hard-coded content</h2><p>As WebPing was created for internal needs in my previous job, its original code base contains lots of references to the former infrastructure it lives in. My professional standards requires me to remove all these sensible informations before making WebPing available to the public.</p><p>For example, here is the commands which allowed me to remove all references to hostnames of our intranets:</p><pre class="brush: bash; title: ; notranslate">
$ git filter-branch --force --prune-empty --tree-filter 'find . -type f -exec perl -i -pe &quot;s/([\w-.]*?)\.(company(-intranet|-extention)?)\.(fr|com|net|org)/intranet\.example\.com/g&quot; &quot;{}&quot; \;' -- --all
</pre><p>The Perl one-liner embedded in the command above will only apply the regular expression on a line-by-line basis. If you want to have the regexp applied on the whole content of each file, you have to use Perl&#8217;s <em>slurp</em> mode (<a href="http://www.math.uiuc.edu/~hildebr/computer/perltips.html">source of that tip</a>):</p><pre class="brush: bash; title: ; notranslate">
$ git filter-branch --force --prune-empty --tree-filter 'perl -0777 -i -pe &quot;s/MAILING_LIST\s*=\s*\[(.*?)\]/MAILING_LIST = \[\]/gs&quot; ./web-ping.py' -- --all
</pre><p>The specific example above helped me removed the content of the <code>MAILING_LIST</code> Python list found in <code>web-ping.py</code>, in order to protect from spam the email addresses of my former co-workers that were unfortunately hard-coded in that variable.</p><p>Another place to hunt for sensible information is commit messages. These can be easily modified thanks to the <code>--msg-filter</code> option. Here is how I removed references to our internal <a href="http://trac.edgewall.org/">Trac</a> tickets:</p><pre class="brush: bash; title: ; notranslate">
$ git filter-branch --force --msg-filter 'sed &quot;s/ (see ticket:666)//g&quot;' -- --all
</pre><p>I also had to remove line returns introduced by abusive usage of Windows text editors (remember, WebPing was born in a corporate environment):</p><pre class="brush: bash; title: ; notranslate">
$ git filter-branch --force --prune-empty --tree-filter 'perl -i -pe &quot;s/\r//&quot; ./*' -- --all
</pre><p>The last useful command I use was the following, to fix author&#8217;s name and email:</p><pre class="brush: bash; title: ; notranslate">
$ git filter-branch --force --env-filter '
    if [ &quot;$GIT_AUTHOR_NAME&quot; = &quot;deldycke&quot; ]
      then
        export GIT_AUTHOR_NAME=&quot;Kevin Deldycke&quot;
        export GIT_AUTHOR_EMAIL=&quot;kevin@deldycke.com&quot;
    fi
    if [ &quot;$GIT_AUTHOR_NAME&quot; = &quot;diehr&quot; ]
      then
        export GIT_AUTHOR_NAME=&quot;Matthieu Diehr&quot;
        export GIT_AUTHOR_EMAIL=&quot;matthieu.diehr@gmail.com&quot;
    fi
  ' -- --all
</pre><p>By using a dozen variations of the commands above, and carefully reviewing the code, I was able to engineer a clean code history.</p><p>But I certainly have been a little too blunt with these regular expressions. Some of them were able to act on binary content. As a result, I <a href="http://github.com/kdeldycke/webping/commit/8c72cbee1a4f72066ffe9fa82b2b06baadca9f24">had to restore static images</a> to their original copy.</p><h2>Final steps</h2><p>Now that your code is clean, all you need is to recreate you tag and fix the <code>init</code> tag date before committing everything to GitHub:</p><pre class="brush: bash; title: ; notranslate">
$ git tag -f &quot;0.0&quot; bad4ff7fc48b8b34f6f661d75c782c7fc0d098c5
$ git tag -f &quot;0.1&quot; 590ac9953df0e3bc76fd02615471e36a9796a065
$ git tag -f &quot;0.2&quot; 33f731054042b02c6d2600e7aead5bb7c4991b12
$ git filter-branch --env-filter '
      if [ $GIT_COMMIT = 361224542bc73bba747c7ca382e992e2cdd0c356 ]
      then
          export GIT_AUTHOR_DATE=&quot;Thu, 01 Jan 1970 00:00:00 +0000&quot;
          export GIT_COMMITTER_DATE=&quot;Thu, 01 Jan 1970 00:00:00 +0000&quot;
      fi' -- --all
$ git remote add origin git@github.com:kdeldycke/webping.git
$ git push -u origin master
$ git push --tags
</pre>]]></content:encoded> <wfw:commentRss>http://kevin.deldycke.com/2011/08/how-open-source-an-internal-corporate-project-webping/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Python ultimate regular expression to catch HTML tags</title><link>http://kevin.deldycke.com/2008/07/python-ultimate-regular-expression-to-catch-html-tags/</link> <comments>http://kevin.deldycke.com/2008/07/python-ultimate-regular-expression-to-catch-html-tags/#comments</comments> <pubDate>Mon, 07 Jul 2008 22:24:26 +0000</pubDate> <dc:creator>Kev</dc:creator> <category><![CDATA[English]]></category> <category><![CDATA[HTML]]></category> <category><![CDATA[programming]]></category> <category><![CDATA[Python]]></category> <category><![CDATA[regexp]]></category> <category><![CDATA[Snippet]]></category> <category><![CDATA[software]]></category> <category><![CDATA[Web]]></category> <category><![CDATA[xHTML]]></category><guid isPermaLink="false">http://kevin.deldycke.com/?p=232</guid> <description><![CDATA[1 year and 3 months ago I&#8217;ve came with a PHP regexp to parse HTML tag soup. Here is an improved version, in Python (my favorite language so far), that is normally much prone to detect strange HTML tags. It &#8230; <a href="http://kevin.deldycke.com/2008/07/python-ultimate-regular-expression-to-catch-html-tags/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description> <content:encoded><![CDATA[<p>1 year and 3 months ago I&#8217;ve came with a <a href="http://kevin.deldycke.com/2007/03/ultimate-regular-expression-for-html-tag-parsing-with-php/">PHP regexp to parse HTML tag soup</a>. Here is an improved version, in Python (my favorite language so far), that is normally much prone to detect strange HTML tags. It also support attributes without value so it&#8217;s closer to the <a href="http://www.w3.org/TR/REC-html40/">HTML specification</a>, but doesn&#8217;t strictly stick to it in order to catch <a href="http://en.wikipedia.org/wiki/Tag_soup">tag soup</a> and malformatted tags.</p><pre class="brush: python; title: ; notranslate">
ultimate_regexp = &quot;(?i)&lt;\/?\w+((\s+\w+(\s*=\s*(?:\&quot;.*?\&quot;|'.*?'|[^'\&quot;&gt;\s]+))?)+\s*|\s*)\/?&gt;&quot;
</pre><p>And here is it applied in a trivial example (in a python shell):</p><pre class="brush: python; title: ; notranslate">
&gt;&gt;&gt; import re
&gt;&gt;&gt;
&gt;&gt;&gt; content = &quot;&quot;&quot;This is the &lt;strong&gt;content&lt;/strong&gt; in which we want to
&lt;em&gt;find&lt;/em&gt; &lt;a href=&quot;http://en.wikipedia.org/wiki/Html&quot;&gt;HTML&lt;/a&gt; tags.&quot;&quot;&quot;
&gt;&gt;&gt;
&gt;&gt;&gt; ultimate_regexp = &quot;(?i)&lt;\/?\w+((\s+\w+(\s*=\s*(?:\&quot;.*?\&quot;|'.*?'|[^'\&quot;&gt;\s]+))?)+\s*|\s*)\/?&gt;&quot;
&gt;&gt;&gt;
&gt;&gt;&gt; for match in re.finditer(ultimate_regexp, content):
...   print repr(match.group())
...
'&lt;strong&gt;'
'&lt;/strong&gt;'
'&lt;em&gt;'
'&lt;/em&gt;'
'&lt;a href=&quot;http://en.wikipedia.org/wiki/Html&quot;&gt;'
'&lt;/a&gt;'
&gt;&gt;&gt;
</pre>]]></content:encoded> <wfw:commentRss>http://kevin.deldycke.com/2008/07/python-ultimate-regular-expression-to-catch-html-tags/feed/</wfw:commentRss> <slash:comments>2</slash:comments> </item> <item><title>Ultimate Regular Expression for HTML tag parsing with PHP</title><link>http://kevin.deldycke.com/2007/03/ultimate-regular-expression-for-html-tag-parsing-with-php/</link> <comments>http://kevin.deldycke.com/2007/03/ultimate-regular-expression-for-html-tag-parsing-with-php/#comments</comments> <pubDate>Fri, 23 Mar 2007 22:27:09 +0000</pubDate> <dc:creator>Kev</dc:creator> <category><![CDATA[English]]></category> <category><![CDATA[HTML]]></category> <category><![CDATA[parsing]]></category> <category><![CDATA[PCRE]]></category> <category><![CDATA[PHP]]></category> <category><![CDATA[regexp]]></category><guid isPermaLink="false">http://kevin.deldycke.com/2007/03/ultimate-regular-expression-for-html-tag-parsing-with-php/</guid> <description><![CDATA[Disclaimer: this is a dirty hack ! To parse HTML or XML, use a dedicated library. Tonight I found the ultimate regex to get HTML tags out of a string. It was written a year ago by Phil Haack on &#8230; <a href="http://kevin.deldycke.com/2007/03/ultimate-regular-expression-for-html-tag-parsing-with-php/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description> <content:encoded><![CDATA[<p><em><strong>Disclaimer</strong>: this is a dirty hack ! To parse HTML or XML, <a href="#comment-4740">use a dedicated library</a>.</em></p><p>Tonight I found the ultimate <a href="http://en.wikipedia.org/wiki/Regular_expression">regex</a> to get HTML tags out of a string. It was <a href="http://haacked.com/archive/2005/04/22/Matching_HTML_With_Regex.aspx">written a year ago by Phil Haack on his blog</a>. His regex is quite bullet-proof: it&#8217;s able to parse HTML tags written on multiple lines which contain any sort of attributes (with or without a value, with single or double quotes).</p><p>Unfortunately his regular expression was designed for Microsoft .NET, so I&#8217;ve spend some time to convert it to PHP. Here is the result:</p><pre class="brush: php; title: ; notranslate">
$regex = &quot;/&lt;\/?\w+((\s+\w+(\s*=\s*(?:\&quot;.*?\&quot;|'.*?'|[^'\&quot;&gt;\s]+))?)+\s*|\s*)\/?&gt;/i&quot;;
</pre><p>And finally, my version based on the one above:</p><pre class="brush: php; title: ; notranslate">
$regex = &quot;/&lt;\/?\w+((\s+(\w|\w[\w-]*\w)(\s*=\s*(?:\&quot;.*?\&quot;|'.*?'|[^'\&quot;&gt;\s]+))?)+\s*|\s*)\/?&gt;/i&quot;;
</pre><p>The latter include the following enhancement:</p><ul><li>accept hyphens as attribute&#8217;s middle characters (<a href="http://kevin.deldycke.com/2007/03/ultimate-regular-expression-for-html-tag-parsing-with-php/#comment-3167">thanks Ged</a>)</li></ul> ]]></content:encoded> <wfw:commentRss>http://kevin.deldycke.com/2007/03/ultimate-regular-expression-for-html-tag-parsing-with-php/feed/</wfw:commentRss> <slash:comments>31</slash:comments> </item> </channel> </rss>
<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Minified using disk: basic
Page Caching using disk: enhanced
Database Caching 2/22 queries in 0.009 seconds using apc
Object Caching 638/677 objects using apc

Served from: kevin.deldycke.com @ 2012-02-08 11:14:02 -->
