import various files, detect duplicates with sqlite, reject image file by deep learning.
Importing a lot of files while eliminating duplication. Currently, this application focuses on image files. Even video files can be used as it is (file hash only).
Dedupper uses the current time date for the file path. You do not have to worry about flooding files in one directory.
choco install imagemagick --version 7.0.7.6 -y
choco install ffmpeg --version 3.4.2 -y
npm install --global --production windows-build-tools
git clone https://github.com/wkdhkr/dedupper.git
cd dedupper
npm install
npm run build
npm link
If you use tensorflow-gpu, following is required. good performance more than cpu.
bin/ include/ lib/
folder to C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0
install miniconda for setup tensorflow etc.
choco install miniconda3 -y
You can choose either tensorflow or tensorflow-cpu.(recommend: tensorflow. use gpu.)
NOTE: activate command is not work in Powershell. try conda install -n root -c pscondaenvs pscondaenvs
.
NOTE: Just install miniconda when using yaml file. pip/conda install is not needed.
manual install version.
conda create -n tensorflow python=3.6 anaconda
acvitate tensorflow
pip install tensorflow-gpu
conda create -n tensorflow-cpu python=3.6 anaconda
acvitate tensorflow-cpu
pip install tensorflow
conda install -c anaconda html5lib
pip install opencv-python
conda install -c conda-forge dlib=19.4
yaml install version.
conda env create -f tensorflow-cpu.yaml
conda env create -f tensorflow.yaml
use cmder for this setup. you can install it by choco install cmder -y
.
first, setup rude-carnie.
cd
mkdir src
cd src
git clone https://github.com/wkdhkr/rude-carnie.git`
cd rude-carnie
wget http://dlib.net/files/shape_predictor_68_face_landmarks.dat.bz2
bunzip2 shape_predictor_68_face_landmarks.dat.bz2
setup checkpoints folder.
checkpoints/
├── age
│ └── inception
│ └── 22801
│ ├── checkpoint
│ ├── checkpoint-14999.data-00000-of-00001
│ ├── checkpoint-14999.index
│ └── checkpoint-14999.meta
└── gender
└── inception
└── 21936
├── checkpoint
├── checkpoint-14999.data-00000-of-00001
├── checkpoint-14999.index
└── checkpoint-14999.meta
next, setup open_nsfw.
cd
cd src
git clone https://github.com/wkdhkr/tensorflow-open_nsfw.git
that it! start following ps1 script files.
Run install.bat to install explorer right click menu.
If uninstalling, run uninstall.bat.
$ dedupper -h
Usage: dedupper [options]
Options:
-x --db-repair repair db by log file.
-C, --no-cache no use file info cache
-m, --manual the current path is registered in the destination.
-k, --keep save the file as keeping state
-r, --relocate relocate saved file
-D, --no-dir-keep no use old dir path for new path
-R, --no-report disable report output
-v, --verbose show debug log
-q, --quiet no prompt window
-w, --wait wait on process end
-l, --log-level [level] log level
-L, --no-log-config no log config
-P, --no-p-hash skip p-hash matching
-p, --path [path] target file path
-n, --dryrun dryrun mode
-h, --help output usage information
You can customize dedupper’s behavior by creating ~/.dedupper.config.js
.
Refer to the source code comment for a description of config. see this.
Default config is this.
const path = require("path");
const { defaultConfig } = require(process.env.USERPROFILE +
"\\AppData\\Roaming\\npm\\node_modules\\dedupper");
const deepLearningApiConfig = {
nsfwApi: "http://localhost:5000/image",
faceDetectWithGenderApi: "http://localhost:5001/face/detect",
facePredictAgeApi: "http://localhost:5002/face/predict"
};
const deepLearningConfig = {
...deepLearningApiConfig,
instantDelete: false,
logicalOperation: "or",
nsfwType: "sfw",
nsfwMode: "allow",
// nsfwMode: "none",
nsfwThreshold: 0.1,
faceCategories: [
["F", "(0, 2)"],
["F", "(4, 6)"],
["F", "(8, 12)"],
["F", "(15, 20)"],
["F", "(25, 32)"],
["F", "(38, 43)"],
["F", "(48, 53)"]
],
faceMode: "allow",
// faceMode: "none",
faceMinLongSide: 300
};
const userConfig = {
archiveExtract: true,
archiveExtractCommand:
'"C:\\Program Files (x86)\\LhaForge\\LhaForge.exe" "/cfg:C:\\Program Files (x86)\\LhaForge\\LhaForge.ini" /e',
// libraryPathHourOffset: 24,
// libraryPathDate: new Date("2018-03-17"),
baseLibraryPathByType: {
["TYPE_IMAGE"]: "B:\\Image",
["TYPE_VIDEO"]: "B:\\Video"
},
forceConfig: {
pHash: false,
keep: true
},
deepLearningConfig,
pathMatchConfig: {
[path.join(process.env.USERPROFILE, "Downloads\\")]: {
maxWorkers: 1,
pHashIgnoreSameDir: false,
keep: false
}
},
classifyTypeConfig: {
TYPE_VIDEO: {
keep: false, // Override pathMatchConfig,
useFileName: true
}
},
// dbBasePath: path.join(process.env.USERPROFILE, ".dedupper/db_test"),
logLevel: "trace",
renameRules: [
// iv code
p => {
const parsedPath = path.parse(p);
const dirName = path.basename(parsedPath.dir);
const match = dirName.match(/^\[(.*?)\]/);
if (match && match[1]) {
const codeName = match[1];
if (codeName === parsedPath.name) {
return parsedPath.dir + parsedPath.ext;
}
}
return p;
},
[/\\photo\\/i, "\\"],
[/.(mp4|wmv|mkv|png|jpg).(mp4|wmv|mkv|png|jpg)/, ".$1"],
["src\\dedupper\\", "\\"],
[/\\[\s ]+/g, "\\"],
[/[\s ]+\\/g, "\\"],
[/\\download(s|)\\/gi, "\\"],
[/\\images\\/gi, "\\"],
[/\\refs\\/gi, "\\"],
[/\\root\\/gi, "\\"],
[/\\new folder[^\\]*\\/g, "\\"],
[/新しいフォルダ(ー|)( \([0-9]+\)|)/g, ""],
[/( - copy)+\\/gi, "\\"],
[/\\\#[0-9]+\\/i, "\\"],
[
new RegExp(
`${["\\\\Users", process.env.USERNAME].join("\\\\")}\\\\`,
"i"
),
"\\"
]
],
ngFileNamePatterns: [
".picasa.ini",
/_cropped\.(jpg|png)/i,
".DS_store",
"Thumbs.db",
".BridgeSort"
],
ngDirPathPatterns: [/\\backup\\/i],
classifyTypeByExtension: defaultConfig.classifyTypeByExtension
};
userConfig.classifyTypeByExtension["txt"] = "TYPE_SCRAP";
module.exports = userConfig;
Dedupper can handle both files and folders. The processing object itself is a file. Empty folders will be deleted.
“Action Type” shows how to process the file. For example, save, erase, replace, or ignore.
see this.
“Reason Type” is the reason why each file became “Action” of each.
see this.
“Classify Types” is assigned one for each file extension type and can be controlled individually as to how it is processed.
see this.
“File State” is the state of the file. There are deduplicated state, accepted state, keep state, and so on.
see this.
When there is no confidence of the threshold with the image judged as the same image, Dedupper leaves the judgment to the user.
The duplicate candidate file becomes a symbolic link with the reason in the file name and appears in the same folder.
The following folders are created.
!replace
!dedupe
!save
!transfer
!replace
, but the destination is a new file path, not a file path that already exists.Dedupper processes based on “mark” given to these folders or file names. Normal behavior is overwritten by these “marks”.
You can distribute files to folders or rewrite “marks” to make final decisions. If you are satisfied with the “marks”, let’s run dedupper again.
Directories and symbolic links that have been used are automatically deleted.
aaa.!s.jpg
!replace/aaa.!s.jpg
aaa.!r.jpg
bbb.!r2.png
bbb_x#2.REASON.png
symlink destination.!2r
.In this case only b02_1.!s.jpg
is saved, others are deleted(erase) or deleted after the hash value is recorded(dedupe). Marks like !s
in the file path are removed when importing.