class:inverse middle center # Preprocessing FASTQ files ---- ### Making them read-only, and concatenating files from different lanes <br> <br> <br> <br> <br> ### Jelmer Poelstra, MCIC Wooster ### 2021/02/17 (updated: 2021-02-16) --- class: inverse middle center # Making raw data as read-only: # Viewing and modifying file permissions ---- <br><br><br> --- ## Showing file permissions - To show file permissions, use **`ls`** with the **`-l`** (*long* format) option.shows file permissions. (The command below also uses the **`-a`** option to show all files, including hidden ones, and **`-h`** shows file sizes in "human-readable" format.) <p align="center"> <img src=img/long-ls.png width="700"> </p> --- ## File permission notation in `ls -l` <br> <p align="center"> <img src=img/permissions.svg width="50%"> </p> --- ## Changing file permissions This can be done in several ways with the `chmod` command: ```sh # chmod <who>=<permission-to-set>`: $ chmod a=rwx # a: all / u: user / g: group / o: others # chmod <who>+<permission-to-add>: $ chmod a+r # chmod <who>-<permission-to-remove>: $ chmod o-w ``` -- - Using numbers: .pull-left[ | Nr | Permission | |----|----------------| | 1 | e**x**ecute | | 2 | **w**rite | | 4 | **r**ead | | 5 | r + x | | 6 | r + w | | 7 | r + w + x | ] .pull-right[ Three numbers are needed to specify permissions for user, group, and others: ```sh # rwx permissions for all: $ chmod 777 file.txt # rwx for me, r for others: $ chmod 744 file.txt ``` ] --- ## File permissions – Making your data read-only - **To make files read-only:** ```sh $ chmod a=r data/raw/* # *a*ll = *r*ead $ chmod u-w data/raw/* # Remove write permission for *u*ser $ chmod 400 data/raw/* # User: read // group & others: none ``` - To operate on directories and all their contents, you need the `-R` flag. (If you do this, you can't create new files in these directories either.) ```sh $ chmod -R a=r data/raw ``` <br> .content-box-info[ As the owner of the file, you can always change file permission back. ] --- ##
Changing file permissions: In practice - First, let's create a file and look at its original permissions: ```sh $ touch testfile $ ls -l testfile #> -rw-r--r-- 1 jelmer PAS0471 0 Feb 15 16:42 testfile ``` - Then, limit permissions for **a**ll to **r**ead, and check what we did: ```sh $ chmod a=r testfile $ ls -l testfile #> -r--r--r-- 1 jelmer PAS0471 0 Feb 15 16:42 testfile ``` --- ##
Changing file permissions: In practice - What happens when we try to overwrite write-protected files? ```sh $ echo "blabla" > testfile #> bash: testfile: Permission denied ``` - What happens when we try to remove write-protected files? ```sh # Answer "y"/"yes" to remove, "n"/"no" to cancel: $ rm testfile #> rm: remove write-protected regular empty file ‘testfile’? # With the -f flag, the file will be removed without prompting: $ rm -f testfile ``` .content-box-info[ While *you* (the file owner) can still remove write-protected files, others really can't read/write/execute files without appropriate permissions. ] --- class: inverse middle center # Concatenating files from different lanes ---- <br><br><br> --- ## Concatenating files from different lanes - For each sample, we have 4 files: 2 read directions (`R1` and `R2`), and 2 lanes (`L001` and `L002`). ```sh $ tree data/fastq #>data/fastq #>├── md5.txt #> ├── nreads.txt #> ├── README.txt #> ├── X6465_Benitez-PonceM_C6_myb2_V1N_1 #> │ ├── checksum_files.md5 #> │ ├── X6465_Benitez-PonceM_C6_myb2_V1N_1_S1_L001_R1_001.fastq.gz #> │ ├── X6465_Benitez-PonceM_C6_myb2_V1N_1_S1_L001_R2_001.fastq.gz #> │ ├── X6465_Benitez-PonceM_C6_myb2_V1N_1_S1_L002_R1_001.fastq.gz #> │ └── X6465_Benitez-PonceM_C6_myb2_V1N_1_S1_L002_R2_001.fastq.gz #> ├── X6466_Benitez-PonceM_C6_myb3_V1N_1 #> │ ├── checksum_files.md5 #> │ ├── X6466_Benitez-PonceM_C6_myb3_V1N_1_S2_L001_R1_001.fastq.gz #> │ ├── X6466_Benitez-PonceM_C6_myb3_V1N_1_S2_L001_R2_001.fastq.gz #> │ ├── X6466_Benitez-PonceM_C6_myb3_V1N_1_S2_L002_R1_001.fastq.gz #> │ └── X6466_Benitez-PonceM_C6_myb3_V1N_1_S2_L002_R2_001.fastq.gz ``` --- ## Concatenating files from different lanes (cont.) Since our end goal is to quantify gene expression on a per-sample basis, we can concatenate the files from the two different lanes. (After mapping, we will also combine R1 and R2, so we will go down to 1 file per sample.) This is extremely simple – to concatenate two gzipped FASTQ file, we can use `cat`: ```sh $ cat file1.fastq.gz file2.fastq.gz > concat.fastq.gz ``` --- ## Concatenating files from different lanes (cont.) This also gives us an opportunity to get rid of the non-informative parts of our file names – for instance, only `C6_myb3` is the sample name in these file names: ```sh X6466_Benitez-PonceM_C6_myb3_V1N_1_S2_L001_R1_001.fastq.gz X6466_Benitez-PonceM_C6_myb3_V1N_1_S2_L002_R1_001.fastq.gz ``` Therefore, we can name the concatenated file: ```sh C6_myb3_R1.fastq.gz ``` --- ## Code to concatenate all files ```sh outdir=results/fastq_concat for fastq_dir in data/fastq/X646[5-9]* data/fastq/X647[0-9]*; do for R in R1 R2; do L1=$(ls "$fastq_dir"/*_L001_"$R"_*fastq.gz) # File from lane 1 L2=${L1/_L001_/_L002_} # File from lane 2 sample_id=$(basename "$L1" | \ sed -E 's/X64[6-7][0-9]_Benitez-PonceM_(.*)_V1N.*fastq.gz/\1/') concat="$outdir"/"$sample_id"_"$R".fastq.gz # File with all seqs cat "$L1" "$L2" > "$concat" # Concatenate the L1 and L2 seqs ls -lh "$L1" # List the input and output files as a check ls -lh "$L2" ls -lh "$concat" echo "----" done done ``` --- ## Sample info | Sample ID | AMF | Treatment |-----------|-----------|------------------- | C6_myb2 | Ri | Agrobacterium_noexp | C6_myb3 | Ri | Agrobacterium_noexp | C6_myb4 | Ri | Agrobacterium_noexp | CT_05 | Ri | mock | CT_06 | Ri | mock | CT_09 | Ri | mock | T_26 | Ri | Agrobacterium_myb | T_29 | Ri | Agrobacterium_myb | T_30 | Ri | Agrobacterium_myb | cnt_1 | control | mock | cnt_2 | control | mock | cnt_3 | control | mock | myb_1 | control | Agrobacterium_myb | myb_2 | control | Agrobacterium_myb | myb_3 | control | Agrobacterium_myb --- ## Sample info (cont.) All samples are from tomato roots. | Column | descriptions |------------|----------------------------------- | **mock** | Mock *Agrobacterium* inoculation | **myb ** | Inoculated with *Agrobacterium*, expressing *myb* reporter | **no_exp** | Inoculated with *Agrobacterium*, root does not express *myb* | **Ri** | *Rhizogenes intraradices* inoculation and observed colonization | **control** | No AMF (Ri) inoculation