Autonomía digital y tecnológica

Código e ideas para una internet distribuida

Linkoteca. sysadmin


I used wget, which is available on any linux-ish system (I ran it on the same Ubuntu server that hosts the sites).

wget –mirror -p –html-extension –convert-links -e robots=off -P . http://url-to-site

That command doesn’t throttle the requests, so it could cause problems if the server has high load. Here’s what that line does:

–mirror: turns on recursion etc… rather than just downloading the single file at the root of the URL, it’ll now suck down the entire site.
-p: download all prerequisites (supporting media etc…) rather than just the html
–html-extension: this adds .html after the downloaded filename, to make sure it plays nicely on whatever system you’re going to view the archive on
–convert-links: rewrite the URLs in the downloaded html files, to point to the downloaded files rather than to the live site. this makes it nice and portable, with everything living in a self-contained directory.
-e robots=off: executes the «robots off» command, telling wget to ignore any directive to ignore the site in question. This is strictly Not a Good Thing To Do, but if you own the site, this is OK. If you don’t own the site being archived, you should obey all robots.txt files or you’ll be a Very Bad Person.
-P .: set the download directory to something. I left it at the default «.» (which means «here») but this is where you could pass in a directory path to tell wget to save the archived site. Handy, if you’re doing this on a regular basis (say, as a cron job or something…)
http://url-to-site: this is the full URL of the site to download. You’ll likely want to change this.

The Web ARChive (WARC) archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive’s ARC_IA File Format[4] that has traditionally been used to store «web crawls» as sequences of content blocks harvested from the World Wide Web. The WARC format generalizes the older format to better support the harvesting, access, and exchange needs of archiving organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned metadata, abbreviated duplicate detection events, and later-date transformations.

From the discussion about Working with ARCHIVE.ORG, we learn that it is important to save not just files but also HTTP headers.

To download a file and save the request and response data to a WARC file, run this:

wget «http://www.archiveteam.org/» –warc-file=»at»

This will download the file to index.html, but it will also create a file at-00000.warc.gz. This is a gzipped WARC file that contains the request and response headers (of the initial redirect and of the Wiki homepage) and the html data.

If you want to have an uncompressed WARC file, use the –no-warc-compression option:

wget «http://www.archiveteam.org/» –warc-file=»at» –no-warc-compression

When IA first started doing their thing, they came across a problem: how do you actually save all of the information related to a website as it existed at a point in time? IA wanted to capture it all, including headers, images, stylesheets, etc.

After a lot of revision the smart folks there built a specification for a file format named WARC, for Web ARCive. The details aren’t super important, but the gist is that it will preserve everything, including headers, in a verifiable, indexed, checksumed format.

I surf the web an awful lot, probably slightly more than your average 13 year old geek. I notice that a lot of sites load rather slowly mostly because your waiting on content from outside the specific domain. For example if you go to a website like thechive.com (one of my favorites) you will notice it takes quite a long time loading the ads. It would be nice if you could block advertisements… oh you can?

Although I mentioned thechive.com I spend most of my time on the net looking for information, not entertainment. These ads really hinder my search speed!

So here is a quick way you can block all the ads. Not only will your surfing be faster but you will also save some bandwidth.

First off I would like to thank the fine folks at http://winhelp2002.mvps.org/ for doing all the leg work and collecting all the data necessary for this to work.

PyPI packages not in the standard library:

  • virtualenv is a very popular tool that creates isolated Python environments for Python libraries. If you’re not familiar with this tool, I highly recommend learning it, as it is a very useful tool, and I’ll be making comparisons to it for the rest of this answer.

    It works by installing a bunch of files in a directory (eg: env/), and then modifying the PATH environment variable to prefix it with a custom bin directory (eg: env/bin/). An exact copy of the python or python3 binary is placed in this directory, but Python is programmed to look for libraries relative to its path first, in the environment directory. It’s not part of Python’s standard library, but is officially blessed by the PyPA (Python Packaging Authority). Once activated, you can install packages in the virtual environment using pip.

  • pyenv is used to isolate Python versions. For example, you may want to test your code against Python 2.7, 3.6, 3.7 and 3.8, so you’ll need a way to switch between them. Once activated, it prefixes the PATH environment variable with ~/.pyenv/shims, where there are special files matching the Python commands (python, pip). These are not copies of the Python-shipped commands; they are special scripts that decide on the fly which version of Python to run based on the PYENV_VERSION environment variable, or the .python-version file, or the ~/.pyenv/version file. pyenv also makes the process of downloading and installing multiple Python versions easier, using the command pyenv install.
  • pyenv-virtualenv is a plugin for pyenv by the same author as pyenv, to allow you to use pyenv and virtualenv at the same time conveniently. However, if you’re using Python 3.3 or later, pyenv-virtualenv will try to run python -m venv if it is available, instead of virtualenv. You can use virtualenv and pyenv together without pyenv-virtualenv, if you don’t want the convenience features.
  • virtualenvwrapper is a set of extensions to virtualenv (see docs). It gives you commands like mkvirtualenv, lssitepackages, and especially workon for switching between different virtualenv directories. This tool is especially useful if you want multiple virtualenv directories.
  • pyenv-virtualenvwrapper is a plugin for pyenv by the same author as pyenv, to conveniently integrate virtualenvwrapper into pyenv.
  • pipenv aims to combine Pipfile, pip and virtualenv into one command on the command-line. The virtualenv directory typically gets placed in ~/.local/share/virtualenvs/XXX, with XXX being a hash of the path of the project directory. This is different from virtualenv, where the directory is typically in the current working directory. pipenv is meant to be used when developing Python applications (as opposed to libraries). There are alternatives to pipenv, such as poetry, which I won’t list here since this question is only about the packages that are similarly named.

Standard library:

  • pyvenv is a script shipped with Python 3 but deprecated in Python 3.6 as it had problems (not to mention the confusing name). In Python 3.6+, the exact equivalent is python3 -m venv.
  • venv is a package shipped with Python 3, which you can run using python3 -m venv (although for some reason some distros separate it out into a separate distro package, such as python3-venv on Ubuntu/Debian). It serves the same purpose as virtualenv, but only has a subset of its features (see a comparison here). virtualenv continues to be more popular than venv, especially since the former supports both Python 2 and 3.

Recommendation for beginners:

This is my personal recommendation for beginners: start by learning virtualenv and pip, tools which work with both Python 2 and 3 and in a variety of situations, and pick up other tools once you start needing them.

The script would:
– extract all file versions to /tmp/all_versions_exported
– take 1 argument – relative path to the file inside git repo
– give result filenames numeric prefix (sortable)
– mention inspected filename in result files (to tell apples apart from oranges:)
– mention commit date in the result filename (see output example below)
– not create empty result files

1. Update the repository index by executing the below command in Terminal:
$ sudo apt-get update

2. Next, execute the below command to clean out the local repository:
$ sudo apt-get clean

3. Execute the below command to remove all the unnecessary packages that are no longer needed:
$ sudo apt-get autoremove

The above command will display the unmet dependencies or broken package’s name.

ncftpput -R -v -u «username» ftp.nixcraft.biz /nixcraft/forum /tmp/phpbb

Where,

-u «username» : Ftp server username
-v : Verbose i.e. show upload progress
-R : Recursive mode; copy whole directory trees.
ftp.nixcraft.biz : Remote ftp server (use FQDN or IP).
/nixcraft/forum : Remote ftp server directory where all files and subdirectories will be uploaded.
/tmp/phpbb : Local directory (or list of files) to upload remote ftp server directory /nixcraft/forum

So what is a RAID array? Being here you probably have enough interest in computers to have heard of RAID but unless you are slightly obsessed with hard drive technology you probably haven’t learned much about it. RAID has been, and to a large part still is the domain of higher-level servers.

RAID describes three main abilities that can be implemented either alone or in combination to best fit various scenarios. These features include «stripping», «mirroring» and «parity».

Stripping, known as RAID level 0 or RAID0 is the process of using two or more drives for simultaneous writing and reading. When a file is to be written to a stripped array the data is divided into chunks and written to the drives in the array at the same time. As a loose example you can take a 10MB file and write it to a RAID0 array with two drives in roughly the time it would normally take to write a 5MB file (twice the speed). The same 10MB file could be written to an array with five drives in roughly the time it would have taken a 2MB file to be written to a single drive (five times as fast). Calculating the actual speed benefits isn’t so cut and dry because of other overhead but you get a good idea.

Next up is «Mirroring» or RAID1. As its name implies, two drives are mirror images of one another. If one drive fails the data is safe thanks to the second identical drive. The down side is that 50% of the physical hard drive space is wasted.

Finally we get to «Parity», used in RAID3, 4, 5 and 6 but most popularly in RAID5. Remember in math class you asked «where will I ever use this in the real world?» Well my friends, Boolean algebra has allowed us a very efficient was to protect data. Lets use a RAID5 array for this example but first let me describe a RAID5 array.

In a RAID5 array you need a minimum of 3 disks. The more you add though the better performance you gain and the more efficiently you use your disk space. The trade off is you need an increasingly more powerful RAID controller and that translates to a higher cost. In a RAID5 array performance is increased by stripping data across the available drives (RAID0). In a RAID0 array though a single disk failure will destroy all the data because part of just about every file is on each disk. Parity is added in RAID5 to deal with this.

Lo que se debe hacer es actualizar un directorio en la variable PATH la cual posiblemente al actualizar la rama se pudo desconfigurar.

sudo export PATH=$PATH:/usr/local/sbin:/usr/sbin:/sbin

Y luego para que esta asignacion no se pierda al reinciar el equipo. Se debe configurar la variable PATH de forma permanente editando el archivo de configuración de su shell de conexión. Como por lo general el shell BASH es el más utilizado, debe editar su archivo:

echo 'export PATH=$PATH:/usr/local/sbin:/usr/sbin:/sbin' >> /home/usuario/.bashrc

To find all files in /home/user/demo directory, enter:

find /home/user/demo -type f -print

To find all files in /home/user/demo directory with permission 777, enter:

find /home/user/demo -type f -perm 777 -print

Finally, apply new permission using the -exec option as follows:

$ find /home/user/demo -type f -perm 777 -print -exec chmod 755 {} ;

To select directories and subdirectories use the following syntax:

$ find /var/www/html -type d -perm 777 -print -exec chmod 755 {} ;
#!/bin/bash

####
# This script automatically creates user accounts with random passwords.
#
# Author: Russ Sanderlin
# Date: 01/21/15
#
###

if [ $# -lt 1 ]; then
echo "Please supply a user name"
echo "Example: " $0 "jsmith"
exit
fi

# Declare local variables, generate random password.

newuser=$1
randompw=$(cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 8 | head -n 1)

# Create new user and assign random password.

useradd $newuser
echo $newuser:$randompw | chpasswd
echo "UserID:" $newuser "has been created with the following password:" $randompw