Websites bewaren
ToDo 9/7
test case on one or more of the following website:
- CEST website (wiki CMS)
later:
- DCA & packed website (proprietary CMS)
- SCART website (drupal CMS)
Different ways of making a static archive of a Drupal website: https://www.drupal.org/node/27882
doel
Websites offline halen en in een repository onderbrengen in een vorm waarin ze op lange termijn bewaard kunnen worden zonder dat de inhoud verloren gaat.
geboden
preamule
- archiving a website should be done before the website is offline, so you can perform controll to check if the archive copy represents the original online
- this also allows to identify external service that are essential for representing the website> so you can decide the scope of the crawling.
- basically, the objective is to crawl the complete content of the website, including the external services
- basically, we talk about taking a snapshot of a website.
- web archiving, means accepting gaps,
- web archiving means reconstruction/mimic things
1. document the technical environment
minimum
- what server configuration the website was running on (php, mysql version)
- technical metadata (e.g. html, css, javascript version)
- CMS, wiki software used to build the website (drupal, mediawiki, joomla)
recommended
2. record the original structure of the website.
minimum
- keep the links as they were (original naming of the URL) and associates with another URL to preserve the function
- keep the files in the order they are put in the folder
recommended
- use the warc format?
3. record the context of the website
minimum
- related external pages (links to other websites/webdocuments)
- related external content (youtube, mediawiki)
- related external services/scripts
recommended
4. record the evolution of the website
web inventory guideline
What is the main URL of the website ? Which URL should you use to archive the website ?
• Archive every URL of one website
• Some URL might only mirror a part of the website and not every part of it. (ex. DCA website)
- . describe the content of the website
- . decide the extent of the crawl (physical scope)
- . decide how often you wnat to crawl (temporal scope)
- . decide to which extent you preserve the look&feel
- do you only archive an original copy?
- do you archive a preservation and access file?
How to deal with changing webstandards:
- document which standards and browsers apply
- emulate the environment
- making static versions
ARC files
research projects
- Liwa
- arcomem
static html
dynamic pages
- scripts/programs
- endless pages
Tools
There are a number of tools for archiving websites. Depending on the website, some ways or tools will be more efficient.
HTTRACK is a good tool to get a clocne/offline version of a website. The only point that could pose a problem with this tool is that in order to have working links, it renames all the files and local url in order to enable local navigation. Therefore, the original page "names" are not kept. It can be a problem for archiving but nor for access.
http://www.httrack.com/page/1/en/index.html
There are however alternatives to HTTRACK, such as the Heritrix crawler with which you can configure what you want to archive and also use formats such as ARC and WARC to store this archive. While you can configure more parameters, it is less user-friendly. It is well known because the Internet Archive uses this tool.
https://webarchive.jira.com/wiki/display/Heritrix/Heritrix;jsessionid=61E63873884C9828116A6DD8A58974B8
There is also a free software to navigate in WARC archives. http://archive-access.sourceforge.net/projects/wera/
GNU Wget is another open source and free software to make website copies. It is also configurable and more user friendly than Heritrix. Just like Heritrix, it has a graphical web interface but can also be used with command lines.
http://www.gnu.org/software/wget/
On mac osx there is also a software to easily make local copies of online website, it is called Sitesucker. It has limitation as to what tags and what formats of file it can crawl. A list of this limitations is available on the software's website.
http://sitesucker.us/mac/mac.html
http://warcreate.com/
This tool is an extension for your browser that allows you to archive one page at a time in the WARC format. The fact that it only archives one page at a time doesn't make it a practical tool to archive entire website.
Grab Site
Grab-site is an easy preconfigured web crawler designed for backing up websites. Give grab-site a URL and it will recursively crawl the site and write WARC files. Internally, grab-site uses wpull for crawling.
https://github.com/ludios/grab-site
Webrecorder and webarchiveplayer (Emanuel)
Database Archiving (Emanuel)
Database to XML tool
http://deeparc.sourceforge.net/
https://github.com/soheilpro/Xinq
SIARD1 and SIARD2 standards
Keep Solution import and export tools for Databases
Archive Facebook (Check if it still works and compare the results with webrecorder)
https://addons.mozilla.org/en-US/firefox/addon/archivefacebook/
An extension to your browser that allows you to make a local archive of a Facebook account.
Twitter archiving (Emanuel / Does it still work?)
"twarc is a command line tool and Python library for archiving Twitter JSON data. Each tweet is represented as a JSON object that is exactly what was returned from the Twitter API. Tweets are stored as line-oriented JSON. Twarc runs in three modes: search, filter stream and hydrate. When running in each mode twarc will stop and resume activity in order to work within the Twitter API's rate limits."
https://github.com/edsu/twarc
Memento protocole (Emanuel)
http://www.dlib.org/dlib/september12/09inbrief.html
Access to archived website (Emanuel)
http://oldweb.today/
A project by Ilya Kreymer supported by Rhizome that allow you to browse website in old versions of browsers or browsers that don't exist anymore such as Netscape.
This website is proposing a good list of tools : http://netpreserve.org/web-archiving/tools-and-software
Webpagina's inventariseren
betwer webpublicaties ipv websites
doel
Alle webpublicaties van een bepaalde organisatie of persoon, plaats of onderwerp verzamelen binnen een bepaalde periode.
Centrale vraag die de inventaris moet beantwoorden: wat heeft iemand over iets op een bepaald moment op het web gezet? Ongeacht of dit via een website, blog, sociaal netwerk is. Het zoekresultaat moet een reeks links naar webpublicaties zijn. Als die webpublicaties niet meer online staan, is het zaak dat uit een digital repository worden gevist. Hoe die publicaties in een repository bewaren is onderwerp van de andere richtlijn.
- ontwikkel een goeie zoekstrategie > welke elementen moeten in de zoekstrategie zitten
- documenteer de zoekstrategie bij de resultaten > in welk formaat moet het archief en de metadat bewaard worden
- beschrijving van het webarchief > standaarden voor metadata over webresources
- hoe inventariseer je dynamische content van een webpublicatie.
- maak je de inventaris van je webpublicaties doorzoekbaar?
- hoe inventariseer je de ontwikkeling in de tijd van een publicatie?
zoekstrategieen:
- depth/breadt first-popularity ranks-topical crawling
zie liwa-arcomem apache nutch heritrix UK web archive portugeuese web archive padicat
guessing linkes extract paramaets from the program code execute of javascript> simultae user activities
crawl strategies 1. depth-first (sequence of dives in to the depth of the page hierarchy) 2. breadth first (level by level lower in the hierarchy 3. select pages by popularity (obv pagerank 4. cntent based selection
topical crawling
focussed on events and rarely around entities based on the intention of the researcher pagen rank and smantics for prioritizin pages