Scrapy Cheat Sheet
In Scrapy, one Spider is made which slithers over the site and assists with fetching information, so to make one, move to the spider folder and make one python document over there. First thing is to name the Spider by assigning it with a named variable and afterwards give the beginning URL through which the Spider will begin scraping. Scrapy cheatsheet. GitHub Gist: instantly share code, notes, and snippets. Making Web Crawlers Using Scrapy for Python Develop web crawlers with Scrapy, a powerful framework for extracting, processing, and storing web data. January 11th, 2019. Cheat Sheets (0) There are currently no cheat sheets in webscraping, you can submit or recommend a new one. The Very Unofficial Dummies Guide To Scapy Author: Adam Maxwell (@catalyst256) Find me at http://itgeekchronicles.co.uk.
The following cheat sheet for commonly used Python Turtle commands will get you up and running with Python Turtle quickly. Turtle is a fun program that dates all the way back to the 1960s when Seymour Papert and his colleagues at MIT created the programming language LOGO which could control a robot turtle with a physical pen in it. Today Turtle Graphics are most often associated with the Python programming language.
Python Turtle Commands
import turtle | Import the turtle library |
turtle_obj = Turtle() | Creates a new Turtle object and opens its window. |
turtle_obj.home() | Moves turtle_obj to the center of the window and then points turtle_obj east. |
turtle_obj.up() | Raises turtle_obj’s pen from the drawing surface. |
turtle_obj.down() | Lowers turtle_obj’s pen to the drawing surface. |
turtle_obj.setheading(degrees) | Points turtle_obj in the indicated direction, which is specified in degrees. East is 0 degrees, north is 90 degrees, west is 180 degrees, and south is 270 degrees. |
turtle_obj.left(degrees) | Rotates turtle_obj to the left by the specified degrees. |
turtle_obj.right(degrees) | Rotates turtle_obj to the right by the specified degrees. |
turtle_obj.goto(x, y) | Moves turtle_obj to the specified position. |
turtle_obj.forward(distance) | Moves turtle_obj the specified distance in the current direction. |
turtle_obj.backward(distance) | Moves turtle_obj the specified distance in the reverse direction. |
turtle_obj.pencolor(r, g, b) | Changes the pen color of turtle_obj to the specified RGB value |
turtle_obj.pencolor(string) | Changes the pen color of turtle_obj to the specified RGB value to the specified string, such as 'red'. Returns the current color of turtle_obj when the arguments are omitted. |
turtle_obj.fillcolor(r, g, b) | Changes the fill color of turtle_obj to the specified RGB value |
turtle_obj.fillcolor(string) | Changes the fill color of turtle_obj to the specified string, such as 'red'. Returns the current fill color of turtle_obj when the arguments are omitted. |
turtle_obj.begin_fill() | Enclose a set of turtle commands that will draw a filled shape using the current fill color. |
turtle_obj.end_fill() | Enclose a set of turtle commands that will draw a filled shape using the current fill color. |
turtle_obj.clear() | Erases all of the turtle’s drawings, without changing the turtle’s state. |
turtle_obj.width(pixels) | Changes the width of turtle_obj to the specified number of pixels. Returns turtle_obj’s current width when the argument is omitted. |
turtle_obj.hideturtle() | Makes the turtle invisible. |
turtle_obj.showturtle() | Makes the turtle visible. |
turtle_obj.position() | Returns the current position (x, y) of turtle_obj. |
turtle_obj.heading() | Returns the current direction of turtle_obj. |
turtle_obj.isdown() | Returns True if turtle_obj’s pen is down or False otherwise. |
Python Turtle Tutorials
Learn how to use all of the commands above in the following tutorials that have working Python example code and program results.
Most important bash commands for managing processes, Git, Python, R, SQL/SQLite and LaTeX for researchers and data scientists. This cheat sheet only focusses on bash commands run from the terminal.
Table of Contents
- Managing processes
- First-aid procedure for killing a running process - Git
- Clone repository from GitHub to local machine - Python
- Virtual environments - R
- Open new window - SQL and SQLite
- Repair corrupt database - Text editing and LaTeX
- Calculate the number of words in a Latex file
Managing processes
First-aid procedure for killing a running process
- open new terminal window
- type
ps + enter
- identify PID (processid) of the process
- type
kill -9 <PID>
OR:
control + C
(twice if needed)
Cronjobs and Crontab
Schedule crontab task
- If you want to run the cronjob on a server: enter the server
- enter
crontab -e
in the terminal - enter
<minutes> <hours> <day of month> <month> <day of week>
- for example
6 0 * * 1-6 cd /home/annerose/Python/continuousscraper/ && python processcontrol.py
- this signifies that the process will start to run Monday through Saturday at 6 minutes past midnight.
For more information, see
- http://www.everydaylinuxuser.com/2014/10/an-everyday-linux-user-guide-to.html and
Kill an existing cronjob
- enter
ps -e
in the terminal to see all existing processes. - determine which processid your process has.
- enter
kill -9 <processid>
Tmux sessions
Tmux allows to keep processes running after ending an ssh session. For more detailed explanation, see here.
- ssh into the remote machine
- start tmux by typing
tmux
into the shell - start the process you want inside the started tmux session
- leave/detach the tmux session by typing
Ctrl+B
and thenD
You can now safely logoff from the remote machine, your process will keep running inside tmux. When you come back again and want to check the status of your process you can use tmux attach
to attach to your tmux session.
If you want to have multiple session running side-by-side you should name each session using Ctrl-B
and $
. You can get a list of the currently running sessions using tmux list-sessions
.
Some more useful tmux commands (see also this video):
Command | Significance |
---|---|
control + -b <command> | to tell the shell that it’s for tmux and not just normal shell. |
control + -b p | previous window |
control + -b n | next window |
control + -b c | create window |
control + -b w | list windows |
control + -b % | split window vertically into two parts |
control + -b | split-horizontally : split window horizontally |
tmux - new s <sessionname> | create a new tmux session |
control + -x | close (kill) tmux pane |
control + -b d | detach from tmux session. (without stopping the process) |
tmux list-sessions | List all tmux sessions |
tmux attach -t <sessionname> | attach to a certain tmux session |
tmux attach | attach all tmux sessions/ any tmux session |
Bash profiles
Create bash profile
touch
creates the file, so no need to run this command when the file already exists. Alternative:
For editing the .bash_profile. opens in a text editor. See here
Git
Clone repository from GitHub to local machine
- create new repository on GitHub
- go to the directory on your local machine where the cloned repository should be saved.
- type
git clone https://github.com/your-name/repository-name.git
- the repository should now appear in the local folder on your machine.
Commit file from terminal
- go to the directory of your repository inside the terminal
- type
git add .
This recurses into sub-directories. Alternative:git add
orgit commit -a
git commit -m “your commit message”
. Commit the changes.git push
. Push the changes.
To see the status of your repository: git status
.
See this useful blog.
Managing branches
Branches are very important when you collaboratively work on Github.
This github page contains useful information on how to create a new branch and how to manage branches on github.
- go to the directory of your repository inside the terminal
- before creating a new branch, make sure all changes are pulled to your local repository
- Create new branch by typing
git checkout -b [name_of_your_new_branch]
- Push the new branch to github by typing
git push origin [name_of_your_new_branch]
- Check out which branches exist for this repository:
git branch
. (If there is only the master branch, it will return* master
.) - Add a new remote for your branch:
git remote add [name_of_your_remote]
. A remote (URL) is Git’s fancy way of saying “the place where your code is stored.” (see here) - Push changes from your commit into your branch (= into your remote):
git push [name_of_your_new_remote] [name_of_your_branch]
- Update your branch from the original (master) branch:
git fetch [name_of_your_remote]
- To merge changes between your branch and the original (master) branch, you should first switch to master branch in your terminal:
git checkout master
. Then simply typegit merge [name_of_your_branch]
.
global .gitignore file
See here
Create a global .gitignore file (file types to be excluded from every git project):
The file is found under Documents/Username (as a hidden file). Open it in a text editor to edit it and add files you don’t want tosync with git/GitHub.
local .gitignore file
In the terminal, go to the working directory of the project you want to commit to github.
The file is found locally in the working environment of the project. Open it in a text editor to edit it and add files.
How to prevent conflicts in a collaborative Github project
The following procedure should help you considerably to prevent conflicts in collaborative Github and Git project.
Before you start working: pull
Once you’ve made any changes to the project:
- Commit
- Pull
- If you get an error message, clean the file, solve conflicts
- Push
To summarize: pull, commit, pull, clean, push
Solve conflict using VIM editor
See this Stackoverflow post: http://stackoverflow.com/questions/5599122/problems-with-entering-git-commit-message-with-vim
If there is a conflict between your local version of the project and the version on Github, a window of the VIM editor will open after you’ve tried to commit your local changes. In this case, you should proceed as follows:
- type
i
into the VIM editor, which opens the editing (insert”) mode - type your merge message
- press
Esc
to be sure to have left insert mode - then type
:wq
followed byEnter
, which writes the current file and then closes it. - your merge should now have been accepted.
Push commits from terminal with two-factor authentification
Linotype for mac. See this helpful page on how to push commits from the terminal when using two-factor authentification on Github:
https://gist.github.com/wikimatze/9790374
Important: You need to use your personal access token, not your Github password to push commits from the terminal.
Python
Virtual environments
Change virtual environment:
How to set up and manage virtual environments in Ubuntu: http://askubuntu.com/questions/244641/how-to-set-up-and-use-a-virtual-python-environment-in-ubuntu
Configure Pycharm to use a virtual environment
See here.
Then set the shell Preferences->Tools->Terminal->Shell path to/bin/bash --rcfile ~/.pycharmrc
Check which python packages are installed
Start scrapy project
Start scrapy project for webscraping: enter the following commandin the terminal (in the directory where you want to start your project).
R
Open new window
- Open new RStudio window from terminal (e.g. when one RStudio needs to run for an extended period of time):
- enter
open -n -a 'rstudio'
in terminal - How to add an RStudio project to Github: https://www.r-bloggers.com/rstudio-and-github/
Add R project to Github
Add the following commands in shell after having created the project in Github:
Markdown and R
Render/compile an R Markdown file from Terminal:
This resource on R Markdown is helpful.
An R Markdown cheatsheet is available from RStudio here.
Options settings
Set options, even options that aren’t defined by default. This can be useful for example for setting your consumer key, consumer secret etc. of your Twitter app:
SQL and SQLite
Repair corrupt database
How to repair db database: see stackoverflow
Merge two SQLite databases
Leaving out duplicates:
Open new SQLiteBrowser window from terminal
SQLiteBrowser is well suited for viewing and editing database files compatible with SQLite.
Scrapy Cheat Sheets
If you want to view several databases side by side, you have to open a new SQLiteBrowser window from terminal (it doesn’t seem to be possible to open a new window from within SQLiteBrowser). To this end, go to the directory where yourapplications are stored (in Mac). Normally, this should be:
Thereafter, type the command to open a new SQLiteBrowser window:
Text editing and LaTeX
Python Scrapy Cheat Sheet
Calculate the number of words in a Latex file
Python Scrapy Cheat Sheet
- Change the working directory of your terminal to where the LaTeX TeX file is located.
- I use one of the two options: (1)
detex
or (2)texcount
detex
:
Enterdetex <document_name>.tex | wc -w -c -l
or justdetex <document_name>.tex | wc
To calculate word count in pdf document:pdftotext <document_name>.pdf - | wc -w
texcount
:
Entertexcount -1 <document_name>.tex
There are thousands of options fortexcount
For example, for including the bibliography in the word count, usetexcount -1 -incbib <document_name>.tex
To include several documents in the word count (e.g. main paper and appendix), just add the different documents behind one another:texcount -1 -incbib <main_document>.tex <appendix>.tex
For more information ontexcount
, see this website