项目作者: jasoncomes

项目描述 :
A way to download a site using WGET.
高级语言:
项目地址: git://github.com/jasoncomes/Site-Importer.git
创建时间: 2019-04-13T20:44:34Z
项目社区:https://github.com/jasoncomes/Site-Importer

开源协议:

下载


Static Site Export/Import to Markdown Files

1. Search & Replace


Copy this file down to project directory and call export.md for Search & Replace. Be sure to exclude the trailing slash(‘/’) from the urls when replacing. Remove this Search & Replace section when completed.

  1. [DOMAIN] = e.g. BusinessAnalytics.com
  2. [URL] = e.g. www.businessanalytics.com
  3. [PROTOCOL] = e.g. https://

2. Import Site


Commands:

  1. wget -m -p -E -nH -H -e robots=off --content-on-error -P '_import' -x "[URL]/sitemap.xml" -x "[URL]/404/" -D[DOMAIN] -X 'comments, /feed, **/feed, **/**/feed, **/**/**/feed, **/**/**/**/feed, wp-json, cdn-cgi' --reject-regex "(\{|(.html|.php|.htm|\/)\?).*?" [URL];
  2. cd _import;

Combined Commands:

  1. wget -m -p -E -nH -H -e robots=off --content-on-error -P '_import' -x "[URL]/sitemap.xml" -x "[URL]/404/" -D[DOMAIN] -X 'comments, /feed, **/feed, **/**/feed, **/**/**/feed, **/**/**/**/feed, wp-json, cdn-cgi' --reject-regex "(\{|(.html|.php|.htm|\/)\?).*?" [URL]; cd _import;

Scan Sitemap.xml (Optional)


This will scan the sitemap.xml and compare to existing import, if there are additional pages/resources this will capture them. Some pages may not be imported the first time due to no links pointing to them. If the same map is a parent of child sitemap.xml files, you’ll need to run this command twice. If Sitemap is named differently and wasn’t downloaded using the above command, re-download the html to _imports directory.

Commands:

  1. rm -rf links.txt;
  2. for file in **/*.xml; do
  3. perl -lne "print for /<loc>(.*?)<\/loc>/g" < $file >> links.txt;
  4. done;
  5. for dir in $(find * -maxdepth 10 -type d); do
  6. dir_escaped=$(echo $dir | sed -e 's/[\/&]/\\&/g');
  7. perl -pi -w -e "s/^http(.*?)$dir_escaped\/?\n$//g" links.txt;
  8. done;
  9. wget -m -c -p -E -nH -e robots=off --content-on-error -X 'comments, /feed, **/feed, **/**/feed, **/**/**/feed, **/**/**/**/feed, wp-json, cdn-cgi' --reject-regex "(\{|(.html|.php|.htm|\/)\?).*?" -i links.txt;

Combined Commands:

  1. rm -rf links.txt; for file in **/*.xml; do perl -lne "print for /<loc>(.*?)<\/loc>/g" < $file >> links.txt; done; for dir in $(find * -maxdepth 10 -type d); do dir_escaped=$(echo $dir | sed -e 's/[\/&]/\\&/g'); perl -pi -w -e "s/^http(.*?)$dir_escaped\/?\n$//gi" links.txt; done; wget -m -c -p -E -nH -e robots=off --content-on-error -X 'comments, /feed, **/feed, **/**/feed, **/**/**/feed, **/**/**/**/feed, wp-json, cdn-cgi' --reject-regex "(\{|(.html|.php|.htm|\/)\?).*?" -i links.txt;

3. Directory Structure


This transform the directory structure into the preferred file index for our Jekyll set up. For example, a set obtained from WordPress would be imported as:

  1. /folder
  2. index.html
  3. /folder
  4. /folder
  5. index.html
  6. /folder
  7. index.html
  8. /folder
  9. index.html

… and then will be converted to:

  1. folderindex.html /folder file.html file.html folderindex.html

Commands:

  1. for file in **/*.htm; do
  2. new_file=$(echo $file | sed 's/\.htm/\.html/g');
  3. mv $file $new_file;
  4. done;
  5. for dir in $(find * -maxdepth 10 -type d); do
  6. if [ -f $dir/index.html ]; then
  7. mv $dir/index.html $dir.html;
  8. fi;
  9. done;
  10. find . -type d -empty -delete;
  11. for file in **/*.html; do
  12. new_file=$(echo $file | sed 's/@.*//g' | sed 's/\.php//g');
  13. mv $file $new_file;
  14. done;

Combined Commands:

  1. for file in **/*.htm; do new_file=$(echo $file | sed 's/\.htm/\.html/g'); mv $file $new_file; done; for dir in $(find * -maxdepth 10 -type d); do if [ -f $dir/index.html ]; then mv $dir/index.html $dir.html; fi; done; find . -type d -empty -delete; for file in **/*.html; do new_file=$(echo $file | sed 's/@.*//g' | sed 's/\.php//g'); mv $file $new_file; done;

4. Front Matter


The commands below will cherry pick important variables to include in various spots on pages, include <meta>, <title>, and even URL structures.

Commands:

  1. for file in **/*.html; do
  2. title=$(perl -l -0777 -ne 'print $1 if /<title.*?>\s*(.*?)\s*<\/title/si' $file);
  3. featured_image=$(perl -l -0777 -ne 'print $1 if /<meta property="og:image" content="(.*?)"/si' $file);
  4. description=$(perl -l -0777 -ne 'print $1 if /<meta property="og:description" content="(.*?)"/si' $file);
  5. robots=$(perl -l -0777 -ne 'print $1 if /<meta name="robots" content="(.*?)"/si' $file);
  6. if [ -z "${description}" ]; then
  7. description=$(perl -l -0777 -ne 'print $1 if /<meta name="description" content="(.*?)"/si' $file);
  8. fi;
  9. permalink=$(echo $file | sed 's/\.html//g' | sed 's/\.php//g' | sed 's/@.*//g');
  10. if [ "$permalink" != "index" ]; then
  11. permalink_escaped=$(echo $permalink | sed -e 's/[\/&]/\\&/g');
  12. permalink="/$permalink/";
  13. else
  14. permalink=$(echo $permalink | sed 's/index/:path/g');
  15. fi;
  16. title=$(echo $title | sed -e "s/'/\'/g" );
  17. description=$(echo $description | sed -e "s/'/\'/g");
  18. perl -lpe "BEGIN{
  19. print '---';
  20. print 'title: \"$title\"';
  21. print 'permalink: $permalink';
  22. print 'description: \"$description\"';
  23. print 'robots: \"$robots\"';
  24. print 'featured_image: \"$featured_image\"';
  25. print '---';
  26. print '';
  27. print '';
  28. }" "$file" > foo && mv foo "$file";
  29. done;

Combined Commands:

  1. for file in **/*.html; do title=$(perl -l -0777 -ne 'print $1 if /<title.*?>\s*(.*?)\s*<\/title/si' $file); featured_image=$(perl -l -0777 -ne 'print $1 if /<meta property="og:image" content="(.*?)"/si' $file); description=$(perl -l -0777 -ne 'print $1 if /<meta property="og:description" content="(.*?)"/si' $file); robots=$(perl -l -0777 -ne 'print $1 if /<meta name="robots" content="(.*?)"/si' $file); if [ -z "${description}" ]; then description=$(perl -l -0777 -ne 'print $1 if /<meta name="description" content="(.*?)"/si' $file); fi; permalink=$(echo $file | sed 's/\.html//g' | sed 's/\.php//g' | sed 's/@.*//g'); if [ "$permalink" != "index" ]; then permalink_escaped=$(echo $permalink | sed -e 's/[\/&]/\\&/g'); permalink="/$permalink/"; else permalink=$(echo $permalink | sed 's/index/:path/g'); fi; title=$(echo $title | sed -e "s/'/\'/g" ); description=$(echo $description | sed -e "s/'/\'/g"); perl -lpe "BEGIN{ print '---'; print 'title: \"$title\"'; print 'permalink: $permalink'; print 'description: \"$description\"'; print 'robots: \"$robots\"'; print 'featured_image: \"$featured_image\"'; print '---'; print ''; print ''; }" "$file" > foo && mv foo "$file"; done;

Note: If you see Front Matter in only the child directories, enable Globstar so Bash Globs work properly.

5. SEO Setup/Cleanup


This will remove any SEO meta tags (generated by WP’s Yoast SEO plugin) and replace it with {% include head-seo.html %}. The head-seo.html will be your source for meta tags.

Commands:

  1. perl -i -p0e 's/<head(.*?)>/<head$1>\n\n{% include head-seo\.html %}\n/si;' **/*.html;
  2. perl -i -p0e 's/<title>.*?<\/title>//si;' **/*.html;
  3. perl -i -p0e 's/<!-- This site is optimized.*?SEO plugin\. -->\s*?\n//si;' **/*.html;
  4. perl -pi -w -e "s/<meta (name|itemprop|property)=(\"|\')(robots|googlebot|keywords|copyright|title|referrer|author|google-site-verification|msvalidate.*|twitter:.*|description|name|image|og:.*|article:.*|fb:.*)(\"|\').*?>\s*?\n//gi;" **/*.html;
  5. perl -pi -w -e "s/<link rel=(\"|\')(canonical|shortlink)(\"|\').*?>\s*?\n//i;" **/*.html;
  6. perl -pi -w -e "s/<script type=(\"|\')application\/ld\+json(\"|\')>.*?<\/script>\s*?\n//i;" **/*.html;

Combined Commands:

  1. perl -i -p0e 's/<head(.*?)>/<head$1>\n\n{% include head-seo\.html %}\n/si;' **/*.html; perl -i -p0e 's/<title>.*?<\/title>//si;' **/*.html; perl -i -p0e 's/<!-- This site is optimized.*?SEO plugin\. -->\s*?\n//si;' **/*.html; perl -pi -w -e "s/<meta (name|itemprop|property)=(\"|\')(robots|googlebot|keywords|copyright|title|referrer|author|google-site-verification|msvalidate.*|twitter:.*|description|name|image|og:.*|article:.*|fb:.*)(\"|\').*?>\s*?\n//gi;" **/*.html; perl -pi -w -e "s/<link rel=(\"|\')(canonical|shortlink)(\"|\').*?>\s*?\n//i;" **/*.html; perl -pi -w -e "s/<script type=(\"|\')application\/ld\+json(\"|\')>.*?<\/script>\s*?\n//i;" **/*.html;

6. Update URLs


This will scour your imported html files and replace any hard-coded links to the main domain URL with the Jekyll variable {{ site.url }} (then the same for rebranded form URLs).

Commands:

  1. perl -pi -w -e 's/https?:\/\/(www.)?[URL]/{{ site.url }}/gi;' **/*.html;
  2. perl -pi -w -e "s/href=(\"|\')(?!(http|www))(.*?)(\.html|\.htm|\.php)(.*?)(\"|\')/href=\$1\$3\/\$5\$6/gi;" **/*.html;

Combined Commands:

  1. perl -pi -w -e 's/https?:\/\/(www.)?[DOMAIN]/{{ site.url }}/gi;' **/*.html; perl -pi -w -e "s/href=(\"|\')(?!(http|www))(.*?)(\.html|\.htm|\.php)(.*?)(\"|\')/href=\$1\$3\/\$5\$6/gi;" **/*.html;

7. Code Cleanup


This will remove a few WordPress-related snippets of code like the feeds, and multiple references to w.org.

Commands:

  1. perl -pi -w -e "s///g;" **/*.html;
  2. perl -pi -w -e "s/featured_image: \"\{\{ site\.url \}\}/featured_image: \"/g;" **/*.html;
  3. perl -pi -w -e "s/src=(\"|\')http:\/\//src=\$1https:\/\//g;" **/*.html;
  4. perl -pi -w -e "s/<link rel=(\"|\')alternate(\"|\') type=(\"|\')(application\/rss\+xml|text\/xml\+oembed|application\/json\+oembed)(\"|\').*?>\n//g;" **/*.html;
  5. perl -pi -w -e "s/<link rel=(\"|\')https:\/\/api\.w\.org\/(\"|\').*?>\n//g;" **/*.html;
  6. perl -pi -w -e "s/<link rel=(\"|\')dns-prefetch(\"|\') href='\/\/s\.w\.org(\"|\').*?>\n//g;" **/*.html;
  7. perl -pi -w -e "s/<script type=(\"|\')text\/javascript(\"|\')>\(?window\.NREUM\|\|\(NREUM=\{\}\).*?<\/script>(\n)?//g;" **/*.html;
  8. perl -pi -w -e "s/<script type=(\"|\')text\/javascript(\"|\')>try\{ clicky\.init.*?<\/script>(\n)?//g;" **/*.html;
  9. perl -pi -w -e "s/<script src=(\"|\')(https:|http:)?\/\/pmetrics\.performancing\.com\/js.*?<\/script>(\n)?//g;" **/*.html;
  10. perl -pi -w -e "s/<noscript><p><img alt=(\"|\')Performancing Metrics.*?<\/noscript>(\n)?//g;" **/*.html;
  11. perl -i -p0e "s/<script type=(\"|\')text\/javascript(\"|\')>\s*?_stq = window\._stq \|\| \[\]\;.*?<\/script>(\n)?//s;" **/*.html;
  12. perl -i -p0e "s/<script>\s*?var _prum = \[\[\'id\'.*?<\/script>(\n)?//s;" **/*.html;
  13. perl -i -p0e "s/<script type=(\"|\')text\/javascript(\"|\')>\s*?window\._wpemojiSettings.*?<\/style>(\n)?//s;" **/*.html;

Combined Commands:

  1. perl -pi -w -e "s///g;" **/*.html; perl -pi -w -e "s/featured_image: \"\{\{ site\.url \}\}/featured_image: \"/g;" **/*.html; perl -pi -w -e "s/src=(\"|\')http:\/\//src=\$1https:\/\//g;" **/*.html; perl -pi -w -e "s/<link rel=(\"|\')alternate(\"|\') type=(\"|\')(application\/rss\+xml|text\/xml\+oembed|application\/json\+oembed)(\"|\').*?>\n//g;" **/*.html; perl -pi -w -e "s/<link rel=(\"|\')https:\/\/api\.w\.org\/(\"|\').*?>\n//g;" **/*.html; perl -pi -w -e "s/<link rel=(\"|\')dns-prefetch(\"|\') href='\/\/s\.w\.org(\"|\').*?>\n//g;" **/*.html; perl -pi -w -e "s/<script type=(\"|\')text\/javascript(\"|\')>\(?window\.NREUM\|\|\(NREUM=\{\}\).*?<\/script>(\n)?//g;" **/*.html; perl -pi -w -e "s/<script type=(\"|\')text\/javascript(\"|\')>try\{ clicky\.init.*?<\/script>(\n)?//g;" **/*.html; perl -pi -w -e "s/<script src=(\"|\')(https:|http:)?\/\/pmetrics\.performancing\.com\/js.*?<\/script>(\n)?//g;" **/*.html; perl -pi -w -e "s/<noscript><p><img alt=(\"|\')Performancing Metrics.*?<\/noscript>(\n)?//g;" **/*.html; perl -i -p0e "s/<script type=(\"|\')text\/javascript(\"|\')>\s*?_stq = window\._stq \|\| \[\]\;.*?<\/script>(\n)?//s;" **/*.html; perl -i -p0e "s/<script>\s*?var _prum = \[\[\'id\'.*?<\/script>(\n)?//s;" **/*.html; perl -i -p0e "s/<script type=(\"|\')text\/javascript(\"|\')>\s*?window\._wpemojiSettings.*?<\/style>(\n)?//s;" **/*.html;

8. Invalid UTF-8 Character


Check for invalid UTF-8 Characters, find and fix prior to adding the _imports into your _collections directory.

Commands:

  1. for file in **/*.html; do
  2. grep -axv '.*' $file;
  3. done;

Combined Commands:

  1. for file in **/*.html; do grep -axv '.*' $file; done;

9. Integrate imported static markdown files with new stack. :)